SIAMCAN:SIMPLE YET EFFECTIVE METHOD TO EN-HANCE SIAMESE SHORT-TERM TRACKING

Abstract

Most traditional Siamese trackers are used to regard the location of the max response map as the center of target. However, it is difficult for these traditional methods to calculate response value accurately when face the similar object, deformation, background clutters and other challenges. So how to get the reliable response map is the key to improve tracking performance. Accordingly, a simple yet effective short-term tracking framework (called SiamCAN),by which bridging the information flow between search branch and template branch, is proposed to solve the above problem in this paper. Moreover, in order to get more accurate target estimation, an anchor-free mechanism and specialized training strategy are applied to narrow the gap between the predicted bounding box and groundtruth. The proposed method achieves state-of-the-art performance on four visual tracking benchmarks including UAV123, OTB100, VOT2018 and VOT2019, outperforming the strong baseline, SiamBAN, by 0.327 → 0.331 on VOT2019 and 0.631 → 0.638 success score, 0.833 → 0.850 precision score on UAV123.

1. INTRODUCTION

Visual object tracking is the fundamental task of computer vision, aiming at tracking unknown object of which the information is given by the first frame. Although great progress has been achieved in recent years, a robust tracker is still in desperate demand due to tricky challenge such as scale variation, appearance deformation and similar object with complex background which can deteriorate tracking performance (Wu et al. (2013) ; Zhang et al. (2014) ). Recently, Siamese Network based trackers have taken a vital place in SOT field due to its accuracy and speed. Since (Tao et al. (2016) ) and (Bertinetto et al. (2016) ) introduced Siamese networks in visual tracking, Siamese structure has been adopted as baseline for researchers to design efficient trackers (Li et al. (2018) ; Zhu et al. (2018a) ; Zhang & Peng (2019) ; Li et al. (2019) ; Xu et al. (2020) ; Chen et al. (2020) ). After siamRPN (Li et al. (2018) ) being proposed to gain more accurate anchor boxes, region proposal network has become an essential part of tracker. However, the anchor scales are manual-set which go against the fact that the tracking target is unknown. Besides, the performance of the Siamese based trackers depends greatly on offline training by using massive frame pairs. Therefore, it highly increases the risk of tracking drift when facing significant deformation, similar object distractors, or complex background, due to the undiscriminating feature learned from the target when the category of the target is excluded from the training dataset. In these years, the attention mechanism has become the spotlight in computer vision which inspires the relative works not only in detection task but also in visual tracking (He et al. (2018) ; Abdelpakey et al. (2018) ; Wang et al. (2018) ; Zhu et al. (2018b) ). The attention mechanism includes channel attention and spatial attention, the former tends to generate a set of channel-weights for modeling interdependencies between channels while the latter focuses on finding the informative part by utilizing the inter-spatial relationship of features. Considering these benefits, Siamese based trackers try to introduce attention module to distinguish target from complex background. Nevertheless, the performance of these trackers is not satisfactory for exploiting the expressive power of the attention mechanism inappropriately. Based on the limitations discussed above, we design a simple Cross-attention Guided Siamese network (SiamCAN) based tracker with anchor-free strategy which performs better than the state-ofthe-art trackers when facing the similar object challenge. SiamCAN takes template channel attention • We formulate a cross-attention guided Siamese framework (SiamCAN) including crosschannel attention and self-spatial attention. The cross-channel attention builds an interactive bridge between the target template and search frame to share the identical channel weights. The self-spatial attention focuses on the discriminative part of the correlated feature map, which is complementary to the cross-channel attention. • The proposed tracker is adaptive box regression, without numerous hyper-parameters setting. In order to get more accurate bounding box, we adopt the proper strategy to utilize the merits of anchor-free at the stage of training. • SiamCAN achieves state-of-the-art results on four large tracking benchmarks, including OTB100 (Wu et al. (2013) ), UAV123 (Mueller et al. (2016) ), VOT2018 (Kristan et al. (2018) ) and VOT2019 (Kristan et al. (2019) ). The speed of tracker can also achieve 35 FPS.

2. RELATED WORK

In this section, we briefly review the recent Siamese based trackers, the anchor-free approaches and attention mechanism in both tracking and detection filed.

2.1. SIAMESE NETWORK BASED TRACKER

The pioneering works, SINT (Tao et al. (2016) ) and SiamFC (Bertinetto et al. (2016) 2019)). However, the complexity of anchor design makes the performance of trackers depend greatly on the effect of anchor training.

2.2. ANCHOR-FREE APPROACHES

In recent time, Anchor-free approaches have developed fast. The achor-free work can be divided into two categories. The first one (Kong et al. (2019) ; Law & Deng (2018) ) aims to estimate the keypoints of the objects, while, the other (Redmon et al. (2016) ; Tian et al. ( 2019))tends to predict the bounding box for each pixel which can avoid presetting the scale ratio of anchors. Not only is anchor-free approach popular in detection field, but it is suitable for target estimation in tracking field due to its high efficiency. SiamFC++ takes example by FCOS (Tian et al. ( 2019)) to design regression subnetwork and add centerness branch to eliminate the low quality samples. SiamCAR (Guo et al. (2020) ) changes the basic network structure additionally, merging the multi-layers features before correlation. Different from SiamCAR, SiamBAN (Chen et al. (2020) ) puts emphasis on the label assignment which improves the tracking performance. Our method differs from the above trackers in details (Section4.3). As shown in Figure 2 , the proposed framework mainly consists of three components, the feature extracted Siamese network, the cross-attention module and anchor-free bounding box regression subnetwork with foreground classification subnetwork.

3.1. FEATURE EXTRACTED SIAMESE NETWORK

Like the most Siamese based tracker, we adopt the fully convolution network without padding, which guarantees the accurate location calculation. Feature extracted network composes of two parts, template branch and search branch. Both of them share the same layer parameters of backbone, by this mean, CNNs can learn the relative feature for them to calculate similarity in the subsequent operation. The template branch intends to encode the exemplar feature in the first frame while the other branch aims to encode the candidates feature which may involve target in the follow-up frames. Set the input in template branch as I t , the subsequent frames in search branch as I s . We feed the I t and I s into backbone, and get the feature φ l (I t ) and φ l (I s ) from different l-th backbone layers. Next, the given features are sent to the corresponding branch after having a convolution with a neck layer to reduce feature channel size to 256 and get the template feature ψ t (I t ) with the search feature ψ s (I s ). At last, crop the 7×7 patch from the center of template feature.

3.2. CROSS-ATTENTION NETWORK

Attention mechanism is created for the purpose that enforce CNNs to focus on the special parts which is of great importance, i.e., channels information and spatial information. The channel attention is designed to explore the interdependencies between channels while the spatial attention tends to make CNNs pay more attention to the most critical areas of the feature. Different from (He et al. (2018) ; Wang et al. (2018) ), the channel attention is used between two branches not being applied as self-attention. Moreover, SATIN (Gao et al. (2020) ) designs a module also called cross-attention , but the 'cross' means the combination of different layers which is different from our method. In this paper, the target branch feature ψ t (I t ) is sent to global average pooling to get aggregated feature Y t , i.e., Y t = 1 W H W,H i=0 ψ t (I t ) Given the aggregated feature, the channel weight is obtained by performing a 1D convolution of size k, i.e., V i = σ( k j=1 ω j y j i ), y j i ∈ Ω k i ( ) Where σ is a Sigmoid function, ω indicates the parameters of 1D convolution and Ω k i indicates the set of k adjacent channels of y i . To let search branch learns the information from target template, we multiply the channel weights with the search feature, i.e., ψ s (I s ) = ψ s (I s ) * V (3)

3.3. CLASSIFICATION AND ANCHOR-FREE REGRESSION SUBNETWORK

As shown in Figure 2 , the correlation feature map is calculated by the depth-wise correlation operation between ψ s (I s ) and ψ t (I t ), i.e., F cls w×h×c = ψ s (I s ) ψ t (I t ) F reg w×h×c = ψ s (I s ) ψ t (I t ) (5) where denotes depth-wise convolution operation. Then, we apply self-spatial attention to the feature map in order to focus on discriminative part automatically, i.e., F cls w×h×c = σ(f ([AvgP (F cls w×h×c ); M axP (F cls w×h×c )])) F reg w×h×c = σ(f ([AvgP (F reg w×h×c ); M axP (F reg w×h×c )])) (7) After that, we use two convolution layers with kernel size 1×1 to reduce the number of channel from 256 to 2 and 4 respectively for each branch and concatenate the feature maps from different layers of backbone by the trainable weights α, i.e., P cls w×h×2 = N l=1 α l * F cls l:w×h×2 P reg w×h×4 = N l=1 α l * F reg l:w×h×4 (9) where N denotes the total number of the backbone layers we use. The classification feature map has two channels, the one represents the foreground and the points (i, j) on P cls w×h×2 (0, i, j) refer to the probability scores of target, the other represents the background and the points (i, j) on P cls w×h×2 (1, i, j) refer to the probability scores of background. The regression feature map has four channels, each of them represents the four direction distances from the points location in search branch input to the four sides of the bounding box respectively, that is to say, each point (i, j) in P reg w×h×2 (:, i, j) is a vector which can denoted as (l, r, t, b). Classification label and regression label. For anchor based methods, the positive sample and the negative one are classified by the value of Intersection over Union between anchor and groundtruth. In this paper, We use ellipse and circle figure region to design label for points (i, j) in feature map, which is inspired by (Chen et al. (2020) ). The ellipse E 1 center and axes length are set by groundtruth center (g xc , g yc ) of groundtruth size ( g w 2 , g h 2 ), We also get the circle C 2 with 0.5g w * 0.5g h (( gw 2 ) 2 + ( g h 2 ) 2 ) 1 2 as radius, i.e., (B(p i ) -g xc ) 2 ( gw 2 ) 2 + (B(p j ) -g yc ) 2 ( g h 2 ) 2 = 1 (10) B(p i ) 2 + B(p j ) 2 = r 2 (11) where B denotes the calculation for the location of points (i, j) in feature map P cls w×h×2 back to the search frame. If the point B(p i , p j ) falls within the C 2 region, it will be given a positive label, and if it falls outside the E 1 area, it will be given a negative label, i.e., label =    1, if C 2 (p(i, j)) < r 2 -1, if E 1 (p(i, j)) > 1 0, otherwise For regression branch, the regression targets can be defined by: d l (i,j) = p i -g x0 , d t (i,j) = p j -g y0 (13) d r (i,j) = g x1 -p i , d b (i,j) = g y1 -p j where (g x0 , g y0 ), (g x1 , g y1 ) denote the left-top and right-bottom coordinates location of the groundtruth. Loss function. We employ cross entropy loss to train the classification network. To predict more accurate bounding box, we adopt the DIoU loss (Zheng et al. (2020) ) to train the regression network, i.e., L reg = 1 -IoU + ρ 2 (p, p gt ) c ( ) where ρ(.) is the Euclidean distance, p and p gt denote the central points of predicted box and groundtruth and c is the diagonal length of the smallest enclosing box covering the two boxes. For regression branch training, DIoU loss can optimize the bounding faster than GIoU loss (Rezatofighi et al. (2019) ). The overall loss function is: L = λ 1 L cls + λ 2 L reg ( ) where constants λ 1 and λ 2 weight the classification loss and regression loss. During model training, we simply set λ 1 = 1, λ 2 = 1 without hyper-parameters searching.

3.4. TRAINING AND INFERENCE

Training. We train our model by using image pairs, a 127×127 pixels template patch and a 255×255 pixels search patch. The training datasets include ImageNet VID (Russakovsky et al. (2015) ), COCO (Lin et al. (2014) ), YouTube-Bounding Boxes (Real et al. (2017) ), ImageNet Det (Real et al. (2017) ) and GOT10k (Huang et al. (2019) ). Due to the numbers of negative samples are more than the positive samples, we set at most 16 positive samples and 48 negative samples from the search image. Besides, in order to get more accurate regression information, we adopt the DIoU loss to optimize the regression branch. Inference. We feed the cropped first frame and the subsequent frames to the feature abstract network as template image and search image. Next, the features are sent to the cross-attention module and pass the classification branch and regression branch. After that, we get the classification map. The location of the highest score represents the most probable center of the tracking target. Then, we use the scale change penalty and the cosine window as that introduced in (Li et al. 

4.1. IMPLEMENTATION DETAILS

Our approach is implemented in Python with Pytorch on 1 RTX2080Ti. The backbone is modified ResNet-50 as in (He et al. (2016) ), and the weights are pre-trained on ImageNet (Russakovsky et al. (2015) ). During the training phase, the model is optimized by the stochastic gradient descent (SGD), at the meantime, the total number of epochs is 20 and the batch size is set as 28. For the first 10 epochs, we frozen the parameters of the backbone and only train the heads structures, for the last 10 epochs, we unfrozen the last 3 blocks of backbone to be trained together. Besides, we warm up the training during the first 5 epoch with a warmup learning rate of 0.001 to 0.005, and in the last 15 epochs, the learning rate exponentially decayed from 0.005 to 0.00005.

4.2. RESULTS ON THREE BENCHMARKS

To affirm the effect of our method, we evaluate our tracker performance with the recent trackers MAML (Wang et 2019)), DIMP-18 (Bhat et al. (2019) ), DaSiamRPN (Zhu et al. (2018a) ), SPM (Wang et al. (2019) )ECO (Danelljan et al. (2017) ), CFNet (Valmadre et al. (2017) ), SiamRPN (Li et al. (2018) ), DeepSRDCF (Danelljan et al. (2015a) ), SRDCF (Danelljan et al. (2015b) ) on four benchmarks UAV123 (Mueller et al. (2016) ), VOT2018 (Kristan et al. (2018) ), VOT2019 (Kristan et al. (2019) ) and OTB100 (Wu et al. (2013) )(details in Appendix A.1).

4.2.1. RESULTS ON UAV123

UAV123 contains 123 challenging video sequences, which can be divided into 11 categories according to their attributes. The performance of the tracker is evaluated by two evaluation metrics, the precision scores and the AUC scores. The AUC scores reflect the overlap between the predict bounding box and ground-truth box, while the precision scores are relative with the distance between the center of the predict bounding box and ground-truth box. In Figure 3 , our method achieves the best performance in precision scores, i.e., 0.850, and the second best AUC scores 0.678. As for the 11 categories of the challenge, SiamCAN ranks 1st or 2nd in Scale Variation, Similar Object, Fast Motion and Low Resolution, in Appendix A.2. The results demonstrate that our tracker can handle the similar object and scale change challenge, due to the learning of cross-attention subnetwork and anchor-free mechanism.

4.2.2. RESULTS ON VOT2018

VOT2018 consists of 60 challenging videos. The evaluation metrics to rank the performance of the tracker based on EAO (Expected Average Overlap) which depends on the accuracy and the robustness. As shown in Table 1 , our tracker outperforms SiamRPN++ by 2.8 points by introducing We can see that the recent proposed MAML gets the highest accuracy scores, while our SiamCAN surpassing MAML by 2.5 points in terms of EAO. Besides, our robustness scores also rank 2nd.

4.3. ANALYSIS OF THE PROPOSED METHOD

Discussion on effective sample selection. Anchor-free method has the weakness that network may produce low-quality predicted bounding boxes far away from the center of target, even though the predicted box is accurate. To address this issue, SiamFC++ and SiamCAR introduce centerness branch to get high quality sample, forcing the network focus on the target center. While, SiamBAN uses ellipse figure region to design label which has the same effect. Accordingly, we do several experiments to find which method performs better. As shown in Table 3 , baseline tracker consists of cross-attention module and anchor-free network. Ellipse label does better than Circle label ( 2 vs 1 ),while the centerness branch with ellipse label even have worse effects ( 2 vs 3 ). Based on 4 . At the training stage, 2 gives the positive label to the points fall within E 2 region, while 4 gives the positive label to the points fall within the C 2 region, the more central position. In this aspect, the comparision in Table 3 ( 2 vs 4 ) can be explained. 4 , the baseline model can get an EAO of 0.351, which consists of a regular backbone, i.e., ResNet50, a classification and an anchor-free regression network. Exchanging DIoU Loss for GIoU Loss during training can get higher scores, due to the more accurate predicted bounding box ( 2 vs 1 ). Adding crossattention module can obtain a large improvement, i.e., 4.3 points on EAO ( 4 vs 2 ). This demonstrates the information interaction between the template branch and the search branch is of great significance. Finally, the tracker utilizes the self-spatial attention can gain another improvement of 2.4 points on EAO ( 4 vs 5 ). Feature visualization. We visualize the features extracted from tracker 1 , tracker 3 and tracker 5 in Figure 5 . On the left side of Figure 5 , tracker 1 vs tracker 3 shows the tracker not equipped with cross-attention module is easier to lose target and focus on the wrong object when appear similar object and the visualized feature demonstrates that cross-attention module can enable tracker to tell target from similar distactors. On the right side of Figure 5 , tracker 5 shows the power of crossattention module combined with proper training strategy. 



Figure 1: Tracking results on the challenges of three aspects: scale variation (book), similar distractors (crabs) and fast motion (dinosaur). Compared with other two state-of-the-art trackers, our tracker (SiamCAN) performs better.

Figure 2: Illustration of the proposed tracking framework, consisting of feature extraction, crossattention module and anchor-free network. In the feature extraction part, the features of the third, fourth and fifth blocks are sent to the cross-attention module. The cross-attention module uses the channel attention from the template branch to combine with the feature from the search branch. The right side shows the classification and anchor-free regression network, which is taken to localize the target and predict the size of bounding box.

(2018)) to guarantee the smooth movements of target. According to the location p of the final score, we can get the predicted boxes B, i.e., b x1 = p i -d reg l , b y1 = p j -d reg t (17) b x2 = p i + d reg r , b y2 = p j + d reg b (18) where d reg l,r,t,b denotes the predicted values of the regression targets on the regression map, (b x1 , b y1 ) and (b x2 , b y2 ) are the top-left corner and bottom-left corner of the predicted box.

al. (2020a)), PriDiMP (Danelljan et al. (2020)), SiamBAN (Chen et al. (2020)), SiamCAR (Guo et al. (2020)), ROAM (Yang et al. (2020)), SiamFC++ (Xu et al. (2020)), STN (Tripathi et al. (2019)), ARTCS (Kristan et al. (2019)), SiamRPN++(Li et al. (2019)),SATIN(Gao et al. (2020)), ATOM(Danelljan et al. (

Figure 3: Success and precision plots on UAV123.

Figure 4: Comparision of different label assignments. Both 2 and 4 predict accurate boundingbox, but the latter focuses more on the target center.

Figure 7: Comparisons on OTB-100 with challenges: Deformation, Background Clusters, Scale Variation and Out-of-Plane Rotation. Our SiamCAN performance ranks in top two.

Figure 8: Comparisons on UAV123 with challenges: Scale Variation, Similar Object, Fast Motion and Low Resolution. Our SiamCAN performance ranks in top two.

Comparison with state-of-the-art trackers on VOT2018. Red, blue and green fonts indicate the top-3 trackers. Table2reflects the evaluation results on VOT2019 compared with the recent trackers.

Comparison with state-of-the-art trackers on VOT2019. Red, blue and green fonts indicate the top-3 trackers.

Results of different sample select methods on VOT2018.

Results of each Components of our tracker on VOT2018.

5. CONCLUSION

In this paper, we propose a simple Siamese-based tracker called SiamCAN, which combined with cross-attention and anchor-free mechanism. The cross-attention module utilizes the target template channel attention to guide the feature learning of search frame, bridging the information flow between each branch. The anchor-free regression discards the fussy design of the anchors, and adjusts the scale ratio of the bounding box automatically. In order to use them to their fullest potential, choosing appropriate label assignment strategy and suitable loss funtion to boost the tracking performance with limited laboratory equipments. Extensive experiments are conducted on four benchmarks, which demonstrate our trackers with light structure yet achieves the state-of-the-art performance, especially in scale variation, background cluster, deformations and similar distractors challenges.

