A NEW PARADIGM FOR CROSS-MODALITY PERSON RE-IDENTIFICATION

Abstract

Visible and infrared Person Re-identification(ReID) is still very challenging on account of few cross-modality dataset and large inter-modality variation. Most existing cross-modality ReID methods have trouble eliminating cross-modality discrepancy resulting from the heterogeneous images. In this paper, we present an effective framework and build a large benchmark, named NPU-ReID. To this end, we propose a dual-path fusion network and taking transformer as the smallest feature extraction unit. To expand cross-modality sample diversity, we propose a modality augmentation strategy to generate semi-modality pedestrian images by exchanging certain patch and the main innovation is that the cross-modality gap can be indirectly minimized by reducing the variance of semi-modality and infrared or visible modality. Moreover, in order to make the traditional triplet loss more suitable for cross-modal matching tasks, multi-masking triplet loss is a targeted design for optimizing the relative distance between anchor and positive/negative samples pairs from cross-modality, especially constraining the distance between simple and hard positive samples. Experimental results demonstrate that our proposed method achieves superior performance than other methods on SYSU-MM01, RegDB and our proposed NPU-ReID dataset, especially on the RegDB dataset with significant improvement of 6.81% in rank1 and 9.65% in mAP.

1. INTRODUCTION

Person re-identification (ReID) is a challenging task in computer vision, which is widely used in autonomous driving, intelligent video surveillance and human-computer interaction systems Ye et al. With cameras which can be switched to infrared mode being widely used in intelligent surveillance systems, cross-modality infrared-visible ReID has been a key but challenging technology. Visible images and infrared images are heterogeneous images pairs with very different visual features. Intuitively, pedestrians in visible images have clearer texture features and valid appearance information than infrared images under good illumination environment, and infrared images can provide more distinct pedestrian outward appearance and integrated contour information. Naturally, robust features representation can be generated by sufficiently incorporating cross-modality complementary information. However, single modality person ReID method is difficult to be directly used for crossmodality tasks because of large inter-modality variations. The differences of images belonging to the same identity from cross-modality will be even greater than that of images belonging to the different identity from the same modality. The large modality gap between visible images and infrared images and unknown environmental factors arises a vitally challenging cross-modality problem. As is shown in Figure 1 , the same identity from the same modality suffers from large intra-modality variations arising from different human poses as well as diverse camera viewpoints. Meanwhile, the heterogeneous imaging processes of different spectrum cameras result in large cross-modality variations. These variations may lead to larger intra-identification difference than cross-identification difference, and then cause wrong matching results. Therefore, it is demanded to prompt a solution to reduce the cross-modality discrepancy and intra-modality variations. Researchers have proposed many methods to address the aforementioned challenges in crossmodality ReID. Several methods map persons images from different modality into a common feature space to minimize modality gap Ye et al. (2018a; b; 2020; ?) . To alleviate the color discrepancy, generative adversarial networks (GANs) is used to synthesize fake RGB/IR images while preserving the identity information as much as possible in many works Wang et al. (2019; 2020) ; Wang et al. (2019); Zhang et al. (2019) . However, there is still the challenge of appearance variations including background clutter and viewpoint variations. Furthermore, these methods continue to use triplet loss or ranking loss in single-modality metric learning to supervise the network to mining identityrelated cues, rather than designing modality-related loss function to learn discriminative features in the cross-modality setting. The quality of dataset directly affect the representation ability of embedding feature, which determines the accuracy and efficiency of identification to some extent. Consequently, we build a crossmodality dataset called NPU-ReID, which makes up for the deficiency of small-scale and uneven modality distribution. We collect images with multi-view camera system consisting of 4 visible cameras and 4 infrared cameras, which ensures that each identity has several infrared and visible images under each camera. Aiming at tackling the concurrent challenge in intra-and cross-modality variations, we present a novel modality augmentation to to eliminate the modality discrepancy. The straightforward operation is to generate semi-modality images by exchanging certain regions with a patch from images of the same identity from another modality, which can deepen the information communication between infrared images and visible images. Simply, the augmented image contains two types of information from different modality, which can effectively reduce the difficulty of cross-modal matching. In addition, the ReID network always trained with cross entropy loss and triplet loss to improve the discrepancy between inter-category and intra-category. We propose the Multi-masking triplet loss to neutralize advantages and disadvantages of traditional triplet loss and triplet loss with hard sample mining and design a cross-modality positive sample distance compress function to reduce intra-category difference.



(2021); Zheng et al. (2019); Miao et al. (2019). Person ReID aims to search target pedestrian across multiple non-overlapping surveillance cameras or from different video clips. At present, most researches performed on single-modality visible images captured in daytime has achieved good performance, such as TransReID He et al. (2021), AGW Ye et al. (2021), MMT Ge et al. (2020), HOReID Wang et al. (2020), PAT Li et al. (2021) and ISP Zhu et al. (2020). However, in night-time surveillance and low-light environments, visible cameras fail to capture person images with rich appearance information. The light limitation determines that single-modality ReID framework fails to satisfy all-weather practical application scenarios.

Figure 1: Illustration of the Person Re-Identification. When the left visible image is used as a query, the images list above is ranked results in single-modality and the images list below is ranked results in cross-modality. The 1st to 3rd cols present True Positive samples, which means these gallery images and the query images all belong to the same person. The last col presents False Positive samples.

