A NEW PARADIGM FOR CROSS-MODALITY PERSON RE-IDENTIFICATION

Abstract

Visible and infrared Person Re-identification(ReID) is still very challenging on account of few cross-modality dataset and large inter-modality variation. Most existing cross-modality ReID methods have trouble eliminating cross-modality discrepancy resulting from the heterogeneous images. In this paper, we present an effective framework and build a large benchmark, named NPU-ReID. To this end, we propose a dual-path fusion network and taking transformer as the smallest feature extraction unit. To expand cross-modality sample diversity, we propose a modality augmentation strategy to generate semi-modality pedestrian images by exchanging certain patch and the main innovation is that the cross-modality gap can be indirectly minimized by reducing the variance of semi-modality and infrared or visible modality. Moreover, in order to make the traditional triplet loss more suitable for cross-modal matching tasks, multi-masking triplet loss is a targeted design for optimizing the relative distance between anchor and positive/negative samples pairs from cross-modality, especially constraining the distance between simple and hard positive samples. Experimental results demonstrate that our proposed method achieves superior performance than other methods on SYSU-MM01, RegDB and our proposed NPU-ReID dataset, especially on the RegDB dataset with significant improvement of 6.81% in rank1 and 9.65% in mAP.

1. INTRODUCTION

Person re-identification (ReID) is a challenging task in computer vision, which is widely used in autonomous driving, intelligent video surveillance and human-computer interaction systems Ye et al. With cameras which can be switched to infrared mode being widely used in intelligent surveillance systems, cross-modality infrared-visible ReID has been a key but challenging technology. Visible images and infrared images are heterogeneous images pairs with very different visual features. Intuitively, pedestrians in visible images have clearer texture features and valid appearance information than infrared images under good illumination environment, and infrared images can provide more distinct pedestrian outward appearance and integrated contour information. Naturally, robust features representation can be generated by sufficiently incorporating cross-modality complementary information. However, single modality person ReID method is difficult to be directly used for crossmodality tasks because of large inter-modality variations. The differences of images belonging to the same identity from cross-modality will be even greater than that of images belonging to the different identity from the same modality. The large modality gap between visible images and infrared images and unknown environmental factors arises a vitally challenging cross-modality problem. As is shown in Figure 1 , the same identity from the same modality suffers from large intra-modality variations arising from different human poses as well as diverse camera viewpoints. Meanwhile, the



(2021); Zheng et al. (2019); Miao et al. (2019). Person ReID aims to search target pedestrian across multiple non-overlapping surveillance cameras or from different video clips. At present, most researches performed on single-modality visible images captured in daytime has achieved good performance, such as TransReID He et al. (2021), AGW Ye et al. (2021), MMT Ge et al. (2020), HOReID Wang et al. (2020), PAT Li et al. (2021) and ISP Zhu et al. (2020). However, in night-time surveillance and low-light environments, visible cameras fail to capture person images with rich appearance information. The light limitation determines that single-modality ReID framework fails to satisfy all-weather practical application scenarios.

