TIME WILL TELL: NEW OUTLOOKS AND A BASELINE FOR TEMPORAL MULTI-VIEW 3D OBJECT DETECTION

Abstract

While recent camera-only 3D detection methods leverage multiple timesteps, the limited history they use significantly hampers the extent to which temporal fusion can improve object perception. Observing that existing works' fusion of multiframe images are instances of temporal stereo matching, we find that performance is hindered by the interplay between 1) the low granularity of matching resolution and 2) the sub-optimal multi-view setup produced by limited history usage. Our theoretical and empirical analysis demonstrates that the optimal temporal difference between views varies significantly for different pixels and depths, making it necessary to fuse many timesteps over long-term history. Building on our investigation, we propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup. Further, we augment the per-frame monocular depth predictions used for long-term, coarse matching with short-term, fine-grained matching and find that long and short term temporal fusion are highly complementary. While maintaining high efficiency, our framework sets new stateof-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set.

1. INTRODUCTION

Recent advances in camera-only 3D detection have alleviated monocular ambiguities by leveraging a short history. Despite their improvements, these outdoor works neglect the majority of past observations, limiting their temporal fusion to a few frames in a short 2-3 second window. These long-term past observations are critical for better depth estimation, which has been demonstrated through oracle experiments (Wang et al., 2021b; Jing et al., 2022) as the main bottleneck of cameraonly pipelines due to their lack of explicit depth measurements. Although existing methods aggregate temporal features differently, in essence, these works all consider regions in 3D space and consider image features corresponding to these hypothesis locations from multiple timesteps. Then, they use this temporal information to determine the occupancy of or the existence of an object at those regions. As such, these works are instances of temporal stereo matching. To quantify the quality of multi-view (temporal) depth estimation possible in these methods, we define localization potential at a 3D location as the magnitude of the change in the source-view projection induced by a change in depth in the reference view. As shown in Figure 1 , a larger localization potential causes depth hypotheses (Yao et al., 2018) for a reference view pixel to be projected further apart, giving them more distinct source view features. Then, the correct depth hypothesis with a stronger match with the source view feature can more easily suppress incorrect depth hypotheses with clearly unrelated features, allowing for more accurate depth estimation. We evaluate the localization potential in driving scenarios and find that only using a few recent frames heavily limits the localization potential, and thus the depth estimation potential, of existing methods. Distinct from the intuition in both indoor works, which select frames with above a minimum translation and rotation (Hou et al., 2019; Sun et al., 2021) , and outdoor works, which often empirically select a single historical frame (Huang & Huang, 2022; Wang et al., 2022c; Liu et al., 2022b) , we find that the optimal rotation and temporal difference between the reference and source frame varies significantly over different pixels, depths, cameras, and ego-motion. Hence, it is necessary to utilize many timesteps over a long history for each pixel and depth to have access to a setup that maximizes its localization potential. Further, we find that localization potential is not only decreased by fewer timesteps but is also hurt by the lower image feature resolution used in existing methods. Both factors significantly hinder the benefits of temporal fusion in prior works. We verify our theoretical analysis by designing a model that naturally follows from our findings. Although existing methods' usage of low-resolution image feature maps for multi-view stereo limits matching quality, our proposed long-term temporal fusion's dramatic increase in localization potential can offset this limitation. Our model adopts the coarse but efficient low-resolution feature maps and leverages a 16-frame BEV cost volume. We find that such a framework already outperforms prior-arts, highlighting the significant gap in utilizing temporal information in existing literature. We extend our model by further exploiting short-term temporal fusion with an efficient sampling module, replacing monocular depth priors in the 16-frame BEV cost volume with a two-view depth prior. This time offsetting the temporal decrease in localization potential with an increase in feature map resolution, we observe a further boost in performance, demonstrating that short-term and long-term temporal fusion are highly complementary. Our main contributions are as follows: • We define localization potential to measure the ease of multi-view depth estimation and use it to theoretically and empirically demonstrate that the optimal rotation and temporal difference between reference and source cameras for multi-view stereo varies significantly over pixels and depths. This runs contrary to intuition in existing works that impose a minimum view change threshold or empirically search for a single past frame to fuse. • We verify our theoretical analysis by designing a model, SOLOFusion, that leverages both ShOrt-term, high-resolution and LOng-term, low-resolution temporal stereo for depth estimation. Critically, we are the first, to the best of our knowledge, to balance the impacts of spatial resolution and temporal difference on localization potential and use it to design an efficient but strong temporal multi-view 3D detector in the autonomous driving task. et al., 2017; Brazil & Liu, 2019; Qin et al., 2019; Xu & Chen, 2018; Zhou et al., 2019) . Some works leverage CAD models (Liu et al., 2021; Manhardt et al., 2019; Barabanau et al., 2020) while others set prediction targets as keypoints (Li et al., 2022e; Zhang et al., 2021) or disentangled 3D parameters (Simonelli et al., 2019; Wang et al., 2021a) . Another line of work predicts in 3D, using monocular depth prediction networks (Fu et al., 2018; Godard et al., 2017) to generate pseudo-LiDAR (Wang et al., 2019; Weng & Kitani, 2019) and applying LiDAR-based 3D detection frameworks. Our paper addresses monocular 3D ambiguity through temporal fusion and is perpendicular.



Figure 1: The depth hypothesis projections onto the t = T -16 source view are further apart, making multi-view depth estimation easier when compared to the t = T -1 source view.

• Our framework significantly outperforms state-of-the-art methods in utilizing temporal information, demonstrating considerable improvement in mAP and mATE over a strong nontemporal baseline as shown in Figure2. SOLOFusion achieves first on the nuScenes test set and outperforms previous best art by 5.2% mAP and 3.7% NDS on the validation set.

availability

https://github.com/Divadi/SOLOFusion.

