PA-LOFTR: LOCAL FEATURE MATCHING WITH 3D POSITION-AWARE TRANSFORMER

Abstract

We propose a novel image feature matching method that utilizes 3D position information to augment feature representation with a deep neural network. The proposed method introduces 3D position embedding to a state-of-the-art feature matcher, LoFTR, and achieves more promising performance. Following the coarse-to-fine matching pipeline of LoFTR, we construct a Transformer-based neural network that generates dense pixel-wise matches. Instead of using 2D position embeddings for transformer, the proposed method generates 3D position embeddings that can precisely capture position correspondence of matches between images. Importantly, in order to guide neural network to learn 3D space information, we augment features with depth information generated by a depth predictor. In this way, our method, PA-LoFTR, can generate 3D position-aware local feature descriptors with Transformer. We experiment on indoor datasets, and results show that PA-LoFTR improves the performance of feature matching compared to state-of-the-art methods.

1. INTRODUCTION

Finding feature matching between images is an important task to many computer vision works, including camera calibration, structure from motion (SfM), visual localization, simultaneous localization and mapping (SLAM), stereo matching, etc. Generally, local feature matching problem can be solved with three stages: feature detection, feature description and feature matching. Most of existing methods follow the pipeline in sequence to handle feature matching. Feature detector narrows the focus to a set of interest points on images. Feature description phase generates corresponding descriptor for each interest point. Finally, correspondences between interest points of images are found by feature matching algorithms. While current methods try to extract features based on visual information from images, they pay little attention to the 3D position information of features, which limits the performance of matching. As the development of deep learning, a number of works have introduced deep architecture to image feature matching. Some methods develop deep networks for feature detection and some focus on feature description or matching with learning process. Recently, several works have developed detector-free deep architectures that can generate dense matches (Rocco et al. ( 2018 2020)). Detector-free methods can provide pixel-wise matches where feature detector can be dropped as each pixel is regarded as a potential interest point. Several detector-free methods achieve better performance for images with poor texture, complex patterns and illumination or large viewpoint change, where feature detector cannot determine enough interest points following the classic pipeline. Some methods, such as LoFTR (Sun et al., 2021) and CoTR (Jiang et al., 2021) , introduce Transformer into image feature matching, where deep architectures can have global receptive field. Taking advantage of the ability to relate two features anywhere from the image, the methods can achieve state-of-the-art performance. However, the detector-free methods that generate dense matches can still give false correspondences under challenging pose changes or repetitive patterns. Though the Transformer-based architectures can learn global relation between features, current methods are still working with local visual features extracted directly from images, and most of them pay little attention to the 3D space information of feature points. Based on observations above, we propose 3D Position-Aware Local Feature Transformer (PA-LoFTR), a novel approach that encodes space information for local feature matching. We follow 1



); Li et al. (2020); Rocco et al. (

