KEYPOINT MATCHING VIA RANDOM NETWORK CON-SENSUS

Abstract

Visual description, detection, and matching of keypoints in images are fundamental components of many computer vision problems, such as camera tracking and (re)localization. Recently, learning-based feature extractors on top of convolutional neural networks (CNNs) have achieved state-of-the-art performance. In this paper, we further explore the usage of CNNs and show that it's possible to leverage randomly initialized CNNs without training. Our observation is that the CNN architecture inherently extracts features with certain extents of robustness to viewpoint/illumination changes and thus, it can be regarded as a descriptor extractor. Consequently, randomized CNNs serve as descriptor extractors and a subsequent consensus mechanism detects keypoints using them. Such description and detection pipeline can be used to match keypoints in images and achieves higher generalization ability than the state-of-the-art methods in our experiments.

1. INTRODUCTION

Keypoint detection, description, and matching in images are fundamental building blocks in many computer vision tasks, such as visual localization (Sattler et al., 2016; Taira et al., 2018; Dusmanu et al., 2019; Revaud et al., 2019; Sarlin et al., 2019; 2020; Tang et al., 2021) , Structure-from-Motion (SfM) (Snavely et al., 2006; Wu, 2013; Cui & Tan, 2015; Schönberger & Frahm, 2016; Lindenberger et al., 2021) , Simultaneous Localization and Mapping (SLAM) (Mur-Artal et al., 2015; Mur-Artal & Tardós, 2017; Dai et al., 2017 ), object detection (Csurka et al., 2004; Yang et al., 2019) , and pose estimation (Suwajanakorn et al., 2018; Kundu et al., 2018) . The keypoints, in general, refer to the salient pixels that are then matched across images forming point-to-point correspondences. They should be discriminative and robust to viewpoint/illumination changes to be accurately matched. Traditional approaches follow a detection-then-description pipeline that first detect salient pixels (Harris et al., 1988; Lowe, 2004; Mikolajczyk & Schmid, 2004 ) then compute local descriptors (Lowe, 1999; Bay et al., 2006; Calonder et al., 2011; Rublee et al., 2011) on top of those pixels. Typically, the detectors consider low-level 2D geometry information such as corners and blobs. To deal with large viewpoint distances, scaled pyramids are applied with Laplace of Gaussian (LOG), Difference of Gaussian (DOG), etc. For description, local statistics such as gradients and histograms are computed and used as visual descriptors. To name a few, SIFT (Lowe, 1999), and its variant RootSIFT (Arandjelović & Zisserman, 2012), are still popular nowadays due to their generality. In recent years, learning-based approaches on top of convolutional neural networks (CNNs) (Yi et al., 2016; Noh et al., 2017; Ono et al., 2018; Mishkin et al., 2018; DeTone et al., 2018; Dusmanu et al., 2019; Revaud et al., 2019) achieve promising results, especially in extreme appearance changes, such as images taken at day or night (Zhou et al., 2016) , and across seasons (Sattler et al., 2018) . Compared with traditional handcrafted approaches, the key advantage of introducing deep learning is the ability to learn robust keypoint representations from large-scale datasets. The aforementioned methods apply either supervised or self-supervised learning mechanisms to train their networks. After training, the off-the-shelf detectors and descriptors generalize well to several new datasets. Benefiting from their simplicity and effectiveness, the learned features such as Super-Point (DeTone et al., 2018 ), D2-Net (Dusmanu et al., 2019 ), and R2D2 (Revaud et al., 2019) are widely used nowadays. In this paper, we further explore CNNs in keypoint detection, description, and matching, without requiring the deep networks to be trained. Our observation is that the CNN architecture itself inher-1

