PATHFUSION: PATH-CONSISTENT LIDAR-CAMERA DEEP FEATURE FUSION

Abstract

Fusing camera with LiDAR is a promising technique to improve the accuracy of 3D detection due to the complementary physical properties. While most existing methods focus on fusing camera features directly with raw LiDAR point clouds or shallow 3D features, it is observed that direct deep 3D feature fusion achieves inferior accuracy due to feature mis-alignment. The mis-alignment that originates from the feature aggregation across large receptive fields becomes increasingly severe for deep network stages. In this paper, we propose PathFusion to enable path-consistent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss between shallow and deep features, which encourages the 2D backbone and its fusion path to transform 2D features in a way that is semantically aligned with the transform of the 3D backbone. We apply PathFusion to the priorart fusion baseline, Focals Conv, and observe more than 1.2% mAP improvements on the nuScenes test split consistently with and without testing-time augmentations. Moreover, PathFusion also improves KITTI AP 3D (R11) by more than 0.6% on moderate level.

1. INTRODUCTION

LiDARs and cameras are widely used in autonomous driving to provide complementary information for 3D detection (e.g., Caesar et al., 2020; Sun et al., 2020; Geiger et al., 2013) . LiDAR point clouds capture better geometry information but suffer from a low resolution due to the power and hardware limitations. By contrast, cameras capture dense and colored images with more semantic information but usually lack the shape and depth information for geometry reasoning. As a result, recent methods (Li et al., 2022c; Vora et al., 2020; Liang et al., 2022; Li et al., 2022b) propose to fuse 2D and 3D features from LiDAR point clouds and camera images to enable accurate and robust 3D detection. LiDAR and camera fusion requires to project the features into the same space. Previous works have proposed to either lift 2D camera features into 3D space (e.g., Vora et al., 2020; Wang et al., 2021a; Li et al., 2022b; Chen et al., 2022c; a) or to align both 2D and 3D features in a common representation space like bird-eye-view (BEV) (e.g., Liu et al., 2022c; Liang et al., 2022) . One important question for feature fusion is to select the correct fusion stage and we illustrate popular choices in Figure 1 . Shallow fusion in Figure 1 (a) (Chen et al., 2022a; Vora et al., 2020; Chen et al., 2022d) fuses camera features with raw LiDAR points directly or shallow LiDAR features. Although shallow fusion benefits from the LiDAR and camera calibration for better feature alignments, the camera features are forced to go through multiple modules which are specialized for 3D feature extractions instead 2D camera processing (Li et al., 2022b ), e.g., voxelization. By contrast, deep fusion (Xu et al., 2021b; Li et al., 2022b) in Figure 1 (b) enables more dedicated LiDAR and camera feature processing. However, since multiple LiDAR points are usually voxelized together and further aggregated with neighboring voxels before fusion, one voxel may correspond to a number of camera features, making the feature alignment very ambiguous. The feature mis-alignment significantly degradates the overall network accuracy and forces the majority of existing works to choose shallow fusion (Chen et al., 2022a) . In this paper, we propose PathFusion to augment existing shallow fusion methods and enable fusing deep LiDAR and camera features. PathFusion introduces a novel path consistency loss to regularize the 2D feature extraction and projection and explicitly encourages the alignments between LiDAR

