PATHFUSION: PATH-CONSISTENT LIDAR-CAMERA DEEP FEATURE FUSION

Abstract

Fusing camera with LiDAR is a promising technique to improve the accuracy of 3D detection due to the complementary physical properties. While most existing methods focus on fusing camera features directly with raw LiDAR point clouds or shallow 3D features, it is observed that direct deep 3D feature fusion achieves inferior accuracy due to feature mis-alignment. The mis-alignment that originates from the feature aggregation across large receptive fields becomes increasingly severe for deep network stages. In this paper, we propose PathFusion to enable path-consistent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss between shallow and deep features, which encourages the 2D backbone and its fusion path to transform 2D features in a way that is semantically aligned with the transform of the 3D backbone. We apply PathFusion to the priorart fusion baseline, Focals Conv, and observe more than 1.2% mAP improvements on the nuScenes test split consistently with and without testing-time augmentations. Moreover, PathFusion also improves KITTI AP 3D (R11) by more than 0.6% on moderate level.

1. INTRODUCTION

LiDARs and cameras are widely used in autonomous driving to provide complementary information for 3D detection (e.g., Caesar et al., 2020; Sun et al., 2020; Geiger et al., 2013) . LiDAR point clouds capture better geometry information but suffer from a low resolution due to the power and hardware limitations. By contrast, cameras capture dense and colored images with more semantic information but usually lack the shape and depth information for geometry reasoning. As a result, recent methods (Li et al., 2022c; Vora et al., 2020; Liang et al., 2022; Li et al., 2022b) propose to fuse 2D and 3D features from LiDAR point clouds and camera images to enable accurate and robust 3D detection. LiDAR and camera fusion requires to project the features into the same space. Previous works have proposed to either lift 2D camera features into 3D space (e.g., Vora et al., 2020; Wang et al., 2021a; Li et al., 2022b; Chen et al., 2022c; a) or to align both 2D and 3D features in a common representation space like bird-eye-view (BEV) (e.g., Liu et al., 2022c; Liang et al., 2022) . One important question for feature fusion is to select the correct fusion stage and we illustrate popular choices in Figure 1 . Shallow fusion in Figure 1 (a) (Chen et al., 2022a; Vora et al., 2020; Chen et al., 2022d) fuses camera features with raw LiDAR points directly or shallow LiDAR features. Although shallow fusion benefits from the LiDAR and camera calibration for better feature alignments, the camera features are forced to go through multiple modules which are specialized for 3D feature extractions instead 2D camera processing (Li et al., 2022b ), e.g., voxelization. By contrast, deep fusion (Xu et al., 2021b; Li et al., 2022b) in Figure 1 (b) enables more dedicated LiDAR and camera feature processing. However, since multiple LiDAR points are usually voxelized together and further aggregated with neighboring voxels before fusion, one voxel may correspond to a number of camera features, making the feature alignment very ambiguous. The feature mis-alignment significantly degradates the overall network accuracy and forces the majority of existing works to choose shallow fusion (Chen et al., 2022a) . In this paper, we propose PathFusion to augment existing shallow fusion methods and enable fusing deep LiDAR and camera features. PathFusion introduces a novel path consistency loss to regularize the 2D feature extraction and projection and explicitly encourages the alignments between LiDAR The rest of the paper is organized as follows. Section 2 introduces related work and section 3 describes the background of 2D and 3D feature fusion. Section 4 provides a motivation example on the challenge of deep feature fusion and we present our method in section 5. Section 6 summarizes our results.

2. RELATED WORK

3D Object Detection with LiDAR or Camera 3D object detection targets at predicting 3D bounding box and can be conducted based on 3D point clouds from LiDARs or 2D images from cameras. LiDAR-based methods mainly encode the raw LiDAR point clouds (Qi et al., 2017a; b) , or the voxelized into sparse voxel (Zhou & Tuzel, 2018) to process the input source as multi-resolution features. Based on those 3D features, various single-stage, two-stage 3D detection heads (Shi et al., 2020; 2019; Deng et al., 2021; Bhattacharyya & Czarnecki, 2020; Yan et al., 2018; Yang et al., 2020; He et al., 2020; Zheng et al., 2021; Lang et al., 2019; Yin et al., 2021a; Yan et al., 2018; Shi et al., 2021; Wang et al., 2020) are proposed to predict the 3D bounding boxes of target objects. Another series of works focus on camera-based method which encodes single-view or multi-view images with 2D backbones such as ResNet (He et al., 2016) . Due to a lack of depth information, 2D-to-3D detection heads are devised to enhance 2D features with implicitly or explicitly predicted depth to generate the 3D bounding box (Liu et al., 2022a; b; Wang et al., 2021c; Huang et al., 2021; Reading et al., 2021; Xie et al., 2022; Huang & Huang, 2022; Li et al., 2022c; Wang et al., 2021b) . LiDAR-camera Fusion Because of the complementary properties of LiDARs and cameras, recent methods propose jointly optimizing both modalities and achieving superior accuracy compared to LiDAR-or camera-only methods. As in Figure 1 (a) and (b), these methods can be largely divided into two categories depending on the fusion stages: (a) shallow fusion decorates the point clouds or shallow LiDAR features with image features to enrich the LiDAR inputs with the image semantic prior (Vora et al., 2020; Yin et al., 2021b; Xu et al., 2021a; Chen et al., 2022c; d; Wu et al., 2022; Li et al., 2022a; Chen et al., 2022a; b; Liang et al., 2018a) ; (b) deep fusion lifts image features into 3D space and combines them in the middle or deep stages of the backbone (Liang et al., 2018b; Huang et al., 2020) . Shallow fusion methods, e.g., Focals Conv (Chen et al., 2022a ), LargeKernel3D (Chen et al., 2022b) , have achieved state-of-the-art accuracy while deep fusion models suffer from the increasingly severe mis-alignment between camera and LiDAR features. Recently, Transfusion (Bai et al., 2022) and DeepFusion (Li et al., 2022b) propose to align the LiDAR and camera features with transformer and leverages cross-attention to dynamically capture the correlations between image and LiDAR features



Figure 1: Overview of three different strategies to fuse the camera and LiDAR features: (a) shallow fusion accurately fuses the 2D feature to the shallow 3D feature; (b) deep fusion projects the 2D features to the 3D feature space; (c) our method proposes path consistency loss to mitigate the feature mis-alignment problem and enables augmentation of shallow fusion methods with deep fusion for better accuracy.

