VISUAL TIMING FOR SOUND SOURCE DEPTH ESTIMATION IN THE WILD

Abstract

Depth estimation enables a wide variety of 3D applications, such as robotics and autonomous driving. Despite significant work on various depth sensors, it is challenging to develop an all-in-one method to meet multiple basic criteria. In this paper, we propose a novel audio-visual learning scheme by integrating semantic features with physical spatial cues to boost monocular depth with only one microphone. Inspired by the flash-to-bang theory, we develop FBDepth, the first passive audio-visual depth estimation framework. It is based on the difference between the time-of-flight (ToF) of the light and the sound. We formulate sound source depth estimation as an audio-visual event localization task for collision events. To approach decimeter-level depth accuracy, we design a coarse-to-fine pipeline to push the temporary localization accuracy from event-level to millisecond-level by aligning audio-visual correspondence and manipulating optical flow. FBDepth feeds the estimated visual timestamp together with the audio clip and object visual features to regress the source depth. We use a mobile phone to collect 3.6K+ video clips with 24 different objects at up to 60m. FBDepth shows superior performance especially at a long range compared to monocular and stereo methods.

1. INTRODUCTION

Depth estimation is the fundamental functionality to enable 3D perception and manipulation. Although there have been significant efforts on developing depth estimation methods with various sensors, current depth estimation schemes fail to achieve a good balance on multiple basic metrics including accuracy, range, angular resolution, cost, and power consumption. Active depth sensing methods actively emit signals, such as LiDAR (Caesar et al., 2020 ), structuredlight (Zhang, 2012 ), mmWave (Barnes et al., 2020 ), ultrasound (Mao et al., 2016 ), WiFi (Vasisht et al., 2016) . They compare the reflected signal with the reference signal to derive time-of-flight (ToF), phase change, or Doppler shift to estimate the depth. Active methods can achieve high accuracy because of the physical fundamental and well-designed modulated sensing signals. Lidar is the most attractive active senor due to its large sensing range and dense point cloud. However, the density is not sufficient enough to enable a small angular resolution. Therefore, the points are too sparse to be recognized at a long distance. Besides, the prohibitive cost and power consumption limit the availability of Lidar on general sensing devices. Passive depth sensing takes signals from the environment for sensing directly. It commonly uses RGB monocular camera (Bhoi, 2019; Laga et al., 2020 ), stereo camera (Cheng et al., 2020) , thermal camera (Lu & Lu, 2021), or multi-view cameras (Long et al., 2021a) . These sensors can achieve pixel-wise angular resolution and consume pretty less energy due to omitting the signal emitting. Among them, stereo matching can effectively estimate the disparity and infer a dense depth map since it transforms the spatial depth to the visual disparity based on the solid physical law. The baseline of the stereo camera determines the effective range and accuracy. Therefore, the dimension of the stereo camera is placed as the critical trade-off with sensing metrics. Thanks to the advance in deep learning, the cheap monocular depth estimation keeps on improving performance with new network structures and high-quality datasets. However, the accuracy is still not satisfactory especially at a long range because it can only regress depth based on the implicit visual cues. It is ill-posed without any physical formulation. Besides, it heavily relies on the dataset. It requires domain adaption and camera calibration for various camera intrinsics (Li et al., 2022) .

