SIAMCAN:SIMPLE YET EFFECTIVE METHOD TO EN-HANCE SIAMESE SHORT-TERM TRACKING

Abstract

Most traditional Siamese trackers are used to regard the location of the max response map as the center of target. However, it is difficult for these traditional methods to calculate response value accurately when face the similar object, deformation, background clutters and other challenges. So how to get the reliable response map is the key to improve tracking performance. Accordingly, a simple yet effective short-term tracking framework (called SiamCAN),by which bridging the information flow between search branch and template branch, is proposed to solve the above problem in this paper. Moreover, in order to get more accurate target estimation, an anchor-free mechanism and specialized training strategy are applied to narrow the gap between the predicted bounding box and groundtruth. The proposed method achieves state-of-the-art performance on four visual tracking benchmarks including UAV123, OTB100, VOT2018 and VOT2019, outperforming the strong baseline, SiamBAN, by 0.327 → 0.331 on VOT2019 and 0.631 → 0.638 success score, 0.833 → 0.850 precision score on UAV123.

1. INTRODUCTION

Visual object tracking is the fundamental task of computer vision, aiming at tracking unknown object of which the information is given by the first frame. Although great progress has been achieved in recent years, a robust tracker is still in desperate demand due to tricky challenge such as scale variation, appearance deformation and similar object with complex background which can deteriorate tracking performance (Wu et al. (2013) 2018)) being proposed to gain more accurate anchor boxes, region proposal network has become an essential part of tracker. However, the anchor scales are manual-set which go against the fact that the tracking target is unknown. Besides, the performance of the Siamese based trackers depends greatly on offline training by using massive frame pairs. Therefore, it highly increases the risk of tracking drift when facing significant deformation, similar object distractors, or complex background, due to the undiscriminating feature learned from the target when the category of the target is excluded from the training dataset. In these years, the attention mechanism has become the spotlight in computer vision which inspires the relative works not only in detection task but also in visual tracking (He et al. (2018); Abdelpakey et al. (2018); Wang et al. (2018); Zhu et al. (2018b) ). The attention mechanism includes channel attention and spatial attention, the former tends to generate a set of channel-weights for modeling interdependencies between channels while the latter focuses on finding the informative part by utilizing the inter-spatial relationship of features. Considering these benefits, Siamese based trackers try to introduce attention module to distinguish target from complex background. Nevertheless, the performance of these trackers is not satisfactory for exploiting the expressive power of the attention mechanism inappropriately. Based on the limitations discussed above, we design a simple Cross-attention Guided Siamese network (SiamCAN) based tracker with anchor-free strategy which performs better than the state-ofthe-art trackers when facing the similar object challenge. SiamCAN takes template channel attention



; Zhang et al. (2014)). Recently, Siamese Network based trackers have taken a vital place in SOT field due to its accuracy and speed. Since (Tao et al. (2016)) and (Bertinetto et al. (2016)) introduced Siamese networks in visual tracking, Siamese structure has been adopted as baseline for researchers to design efficient trackers (Li et al. (2018); Zhu et al. (2018a); Zhang & Peng (2019); Li et al. (2019); Xu et al. (2020); Chen et al. (2020)). After siamRPN (Li et al. (

