LEARNING TO COUNT EVERYTHING: TRANSFORMER-BASED TRACKERS ARE STRONG BASELINES FOR CLASS AGNOSTIC COUNTING

Abstract

Class agnostic counting (CAC) is a vision task which can be used to count the total occurrence number of any given reference objects in the query image. The task is usually formulated as density map estimation problem through similarity computation among few image samples of the reference object and the query image. In this paper, we show the the popular and effective similarity computation operation, bilinear similarity, actually share high resemblance with self-attention and cross-attention operations which are widely used in the transformer architecture. Inspired by this observation, since the formulation of visual object tracking task is similar to CAC, we show the advanced attention modules of transformer-based trackers are actually powerful matching tools for the CAC task. These modules allow to learn more distinct features to capture the shared patterns among the query and reference images. In addition, we propose a transformer-based class agnostic counting framework by adapting transformer-based trackers for CAC. We demonstrate the effectiveness of the proposed framework with two state-ofthe-art transformer-based trackers, MixFormer and TransT, with extensive experiments and ablation studies. The proposed methods outperform other state-of-theart methods on the challenging FSC-147 and CARPK datasets and achieve new state-of-the-art performances. The code will be publicly available upon acceptance.

1. INTRODUCTION

Object counting is a popular research topic in the vision community with a wide spread of applications, including visual surveillance, intelligent agriculture, etc. It aims to count the occurrence number of target objects in an image. The object counting methods can be classified into two major categories: class-specific object counting and class-agnostic counting. For class-specific object counting, it usually focuses on counting a specific category such as car, animal, or people, etc, where crowd counting to count the number of a crowd of people is well studied by Song et al. ( 2021 et al. (2021) arises recently and aims to count any novel objects within the query image, especially for those objects of unseen classes during the training stage. Given several reference object images of the target class, the CAC model is able to predict the number of occurrences within the query image. Current models share a similar network architecture, consisting of a feature extractor, a matching module, and a density head. Once the query and reference feature maps are extracted from the feature extractor, they are fed into the matching module to compute the similarity map followed by a density head to yield a density map estimation. The sum of the values in the density map is used as the final estimated object count. Nevertheless, the major drawbacks of traditional CAC models are their fixed matching framework, which performs template matching in the same fashion regardless of the variations in the patterns of the reference patterns. Moreover, the extracted features for matching are not discriminative enough across categories. Last but not least, traditional models are supervised by pixel-wise root-mean squared error (RMSE) between ground-truth and predicted density map, 

2. RELATED WORKS

Due to a large amount of related works in the literature, we briefly review the relevant ones on class-specific and class-agnostic object counting as follows.

2.1. CLASS-SPECIFIC OBJECT COUNTING

Class-specific counting focuses on the task to count the occurrence number of the objects of specific classes, such as crowd counting. In addition, the task is usually approached in two paradigms: 



); Cheng et al. (2022; 2019); Li et al. (2018). However, it requires to train an individual model for each category with tremendous efforts on collecting thousands of training images with annotations and fail to work for unseen classes. In contrast, class-agnostic counting (CAC) studied by Lu et al. (2018); Ranjan et al. (2021); Yang

Figure 1: A general pipeline of CAC models consists of a feature extractor, matcher, and density head. Given a query image and at least one reference image, the model learns to predict a density map for counting. which fails to recognize predictions with respect to each ground-truth object as a unit. The stateof-the-art model BMNet proposed by Shi et al. (2022) alleviates such weaknesses by introducing:(1) bilinear similarity (or dot-product attention) in matching , (2) applying self-attention on feature maps after feature extraction, and (3) additional contrastive loss function. Yet, these revisions are inadequate for more accurate object counting.In this work, we overcome the above problems of matching and feature extraction by integrating transformer-based object tracking for CAC feature extractor. We first show the connections of the popular bilinear similarity matching module with the prevailing attention modules used by transformer-based methods proposed byVaswani et al. (2017). In addition, among different transformer-based methods, since the visual object tracking share similar formulation to localize the target object in the query images (or upcoming video frames) as the CAC task, we demonstrate he advanced attention modules of transformer-based trackers are actually powerful matching tools for the CAC task. These attention modules allow to learn more distinct features to capture the shared patterns among the query and reference images. We verify our idea by adapting two state-of-theart transformer-based trackers, MixFormer byCui et al. (2022)  and TransT by Chen et al. (2021), for the CAC task by replacing the original prediction head for tracking with the U-Net-like density head for density map estimation. With extensive experiments and ablation studies, the results not only demonstrate the effectiveness of the proposed methods but also show the proposed method outperform other state-of-the-art methods on the challenging FSC-147 and CARPK datasets with new state-of-the-art performances. These further show the CAC models based transformer-based trackers are strong baseline for the CAC task.

detection-based and regression-based methods. The detection-based methods, including the methods proposed by Leibe et al. (2005); Hsieh et al. (2017), perform explicit object detection over the input image using visual object detectors and then get the count. However, similar to the object detection methods, their performances are also sensitive to the situations when the objects are overlapped, occluded, or crowded. To address these problems, the regression methods proposed by Thanasutives et al. (2021); Ma et al. (2021); Cheng et al. (2022) instead predict the density map of input images where each pixel value can be interpreted as the fraction or the confidence level of the target object present in the query image. The sum of these values is then used as the estimated object count. In addition, the ground truth density maps for training are generated by convolving point annotations

