PA-LOFTR: LOCAL FEATURE MATCHING WITH 3D POSITION-AWARE TRANSFORMER

Abstract

We propose a novel image feature matching method that utilizes 3D position information to augment feature representation with a deep neural network. The proposed method introduces 3D position embedding to a state-of-the-art feature matcher, LoFTR, and achieves more promising performance. Following the coarse-to-fine matching pipeline of LoFTR, we construct a Transformer-based neural network that generates dense pixel-wise matches. Instead of using 2D position embeddings for transformer, the proposed method generates 3D position embeddings that can precisely capture position correspondence of matches between images. Importantly, in order to guide neural network to learn 3D space information, we augment features with depth information generated by a depth predictor. In this way, our method, PA-LoFTR, can generate 3D position-aware local feature descriptors with Transformer. We experiment on indoor datasets, and results show that PA-LoFTR improves the performance of feature matching compared to state-of-the-art methods.

1. INTRODUCTION

Finding feature matching between images is an important task to many computer vision works, including camera calibration, structure from motion (SfM), visual localization, simultaneous localization and mapping (SLAM), stereo matching, etc. Generally, local feature matching problem can be solved with three stages: feature detection, feature description and feature matching. Most of existing methods follow the pipeline in sequence to handle feature matching. Feature detector narrows the focus to a set of interest points on images. Feature description phase generates corresponding descriptor for each interest point. Finally, correspondences between interest points of images are found by feature matching algorithms. While current methods try to extract features based on visual information from images, they pay little attention to the 3D position information of features, which limits the performance of matching. As the development of deep learning, a number of works have introduced deep architecture to image feature matching. Some methods develop deep networks for feature detection and some focus on feature description or matching with learning process. Recently, several works have developed detector-free deep architectures that can generate dense matches (Rocco et al. (2018); Li et al. (2020) ; Rocco et al. ( 2020)). Detector-free methods can provide pixel-wise matches where feature detector can be dropped as each pixel is regarded as a potential interest point. Several detector-free methods achieve better performance for images with poor texture, complex patterns and illumination or large viewpoint change, where feature detector cannot determine enough interest points following the classic pipeline. Some methods, such as LoFTR (Sun et al., 2021) and CoTR (Jiang et al., 2021) , introduce Transformer into image feature matching, where deep architectures can have global receptive field. Taking advantage of the ability to relate two features anywhere from the image, the methods can achieve state-of-the-art performance. However, the detector-free methods that generate dense matches can still give false correspondences under challenging pose changes or repetitive patterns. Though the Transformer-based architectures can learn global relation between features, current methods are still working with local visual features extracted directly from images, and most of them pay little attention to the 3D space information of feature points. Based on observations above, we propose 3D Position-Aware Local Feature Transformer (PA-LoFTR), a novel approach that encodes space information for local feature matching. We follow the coarse-to-fine structure of LoFTR to construct deep architecture. Inspired by the usage of position encoding for Transformer, we develop a 3D position embedding generator which encodes 3D point clouds for each pixel instead of encoding 2D pixel coordinates. When self and cross attention layers learn relations between local visual features, we add 3D position embeddings at each encoder layer in Transformer to boost learning process. Additionally, we construct a depth predictor branch to learn depth distribution of images, which can further help model learn space information. By encoding 3D position information and co-relating with depth features, PA-LoFTR learns meaningful features containing both visual and 3D space information that greatly help determine precise image matches. We evaluate the proposed architecture with a indoor dataset on camera pose estimation and test effectiveness of 3D position embedding on stereo matching tasks. The experiments show that PA-LoFTR can provide high-quality matches under challenging scenarios, and achieve state-of-the-art performance on some tasks. In this study, our main technique contributions are: • We show that a 3D position encoder serving as position embedding generator for Transformer can greatly improve image correspondence quality determined by neural network. • We propose a depth feature generator that give rough depth distributions for single image, which can help Transformer learn more space information and improve matching performance. • We demonstrate that PA-LoFTR achieves state-of-the-art performance on indoor dataset and multiple tasks. 



Figure 1: Overview of Proposed method. Given a pair of images I A , I B , PA-LoFTR uses a CNN backbone to give multi-level feature maps. Coarse-level feature maps F A V , F B V are augmented by three modules: 1. Depth Feature Generator predicts dense depth map for F A V , F B V , and gives corresponding depth features F A D , F B D . 2. Position Embedding Generator prepares 3D position embeddings P E A , P E B given camera intrinsic and extrinsic. 3. Position-Aware Transformer Encoder combines 3D position embeddings and depth features into visual features F A V , F B V with self and cross attention layers.

Feature Matching Pipeline. Classic feature matching pipeline includes feature detection, feature description and feature matching (Low (2004); Bian et al. (2017); Sattler et al. (2009); Tuytelaars & Gool (2000)). Many methods follow the classic feature matching pipeline to handle the image correspondence problem. SIFT (Lowe, 2004) introduces a way to build keypoint detector and

