SIAMCAN:SIMPLE YET EFFECTIVE METHOD TO EN-HANCE SIAMESE SHORT-TERM TRACKING

Abstract

Most traditional Siamese trackers are used to regard the location of the max response map as the center of target. However, it is difficult for these traditional methods to calculate response value accurately when face the similar object, deformation, background clutters and other challenges. So how to get the reliable response map is the key to improve tracking performance. Accordingly, a simple yet effective short-term tracking framework (called SiamCAN),by which bridging the information flow between search branch and template branch, is proposed to solve the above problem in this paper. Moreover, in order to get more accurate target estimation, an anchor-free mechanism and specialized training strategy are applied to narrow the gap between the predicted bounding box and groundtruth. The proposed method achieves state-of-the-art performance on four visual tracking benchmarks including UAV123, OTB100, VOT2018 and VOT2019, outperforming the strong baseline, SiamBAN, by 0.327 → 0.331 on VOT2019 and 0.631 → 0.638 success score, 0.833 → 0.850 precision score on UAV123.

1. INTRODUCTION

Visual object tracking is the fundamental task of computer vision, aiming at tracking unknown object of which the information is given by the first frame. Although great progress has been achieved in recent years, a robust tracker is still in desperate demand due to tricky challenge such as scale variation, appearance deformation and similar object with complex background which can deteriorate tracking performance (Wu et al. (2013) 2020)). After siamRPN (Li et al. (2018) ) being proposed to gain more accurate anchor boxes, region proposal network has become an essential part of tracker. However, the anchor scales are manual-set which go against the fact that the tracking target is unknown. Besides, the performance of the Siamese based trackers depends greatly on offline training by using massive frame pairs. Therefore, it highly increases the risk of tracking drift when facing significant deformation, similar object distractors, or complex background, due to the undiscriminating feature learned from the target when the category of the target is excluded from the training dataset. In these years, the attention mechanism has become the spotlight in computer vision which inspires the relative works not only in detection task but also in visual tracking (He et al. ( 2018 2018b)). The attention mechanism includes channel attention and spatial attention, the former tends to generate a set of channel-weights for modeling interdependencies between channels while the latter focuses on finding the informative part by utilizing the inter-spatial relationship of features. Considering these benefits, Siamese based trackers try to introduce attention module to distinguish target from complex background. Nevertheless, the performance of these trackers is not satisfactory for exploiting the expressive power of the attention mechanism inappropriately. Based on the limitations discussed above, we design a simple Cross-attention Guided Siamese network (SiamCAN) based tracker with anchor-free strategy which performs better than the state-ofthe-art trackers when facing the similar object challenge. SiamCAN takes template channel attention to guide the feature extraction of search image by which can strengthen the ability of tracker to overcome distractors and complex backgrounds, performing better than most of Siamese-based trackers, as shown in Figure 1 . The main contributions of this work are: • We formulate a cross-attention guided Siamese framework (SiamCAN) including crosschannel attention and self-spatial attention. The cross-channel attention builds an interactive bridge between the target template and search frame to share the identical channel weights. The self-spatial attention focuses on the discriminative part of the correlated feature map, which is complementary to the cross-channel attention. • The proposed tracker is adaptive box regression, without numerous hyper-parameters setting. In order to get more accurate bounding box, we adopt the proper strategy to utilize the merits of anchor-free at the stage of training. • 

2. RELATED WORK

In this section, we briefly review the recent Siamese based trackers, the anchor-free approaches and attention mechanism in both tracking and detection filed. 2019)). However, the complexity of anchor design makes the performance of trackers depend greatly on the effect of anchor training.



; Zhang et al. (2014)). Recently, Siamese Network based trackers have taken a vital place in SOT field due to its accuracy and speed. Since (Tao et al. (2016)) and (Bertinetto et al. (2016)) introduced Siamese networks in visual tracking, Siamese structure has been adopted as baseline for researchers to design efficient trackers (Li et al. (2018); Zhu et al. (2018a); Zhang & Peng (2019); Li et al. (2019); Xu et al. (2020); Chen et al. (

); Abdelpakey et al. (2018); Wang et al. (2018); Zhu et al. (

Figure 1: Tracking results on the challenges of three aspects: scale variation (book), similar distractors (crabs) and fast motion (dinosaur). Compared with other two state-of-the-art trackers, our tracker (SiamCAN) performs better.

SiamCAN achieves state-of-the-art results on four large tracking benchmarks, including OTB100 (Wu et al. (2013)), UAV123 (Mueller et al. (2016)), VOT2018 (Kristan et al. (2018)) and VOT2019 (Kristan et al. (2019)). The speed of tracker can also achieve 35 FPS.

SIAMESE NETWORK BASED TRACKER The pioneering works, SINT (Tao et al. (2016)) and SiamFC (Bertinetto et al. (2016)), first introduce the siamese network in tracking filed. Due to its fast speed with light structure, Siamese network draws great attention from the visual tracking community. SiamFC tries to use siamese network to learn the feature of both target template and search frame, and compare the similarity of them to find the most confident candidates. Although tracks fast, it cannot handle the scale variation problem by applying several scales of feature map. Inspired by Faster-RCNN (Ren et al. (2015)) from object detection, SiamRPN (Li et al. (2018)) draws on the region proposal network to get more various scale ratio bounding boxes. Since then, the RPN module has become an essential part of the tracker (Zhu et al. (2018a); Zhang & Peng (2019); Li et al. (2019); Dong & Shen (2018); Fan & Ling (

