LEARNING TO COUNT EVERYTHING: TRANSFORMER-BASED TRACKERS ARE STRONG BASELINES FOR CLASS AGNOSTIC COUNTING

Abstract

Class agnostic counting (CAC) is a vision task which can be used to count the total occurrence number of any given reference objects in the query image. The task is usually formulated as density map estimation problem through similarity computation among few image samples of the reference object and the query image. In this paper, we show the the popular and effective similarity computation operation, bilinear similarity, actually share high resemblance with self-attention and cross-attention operations which are widely used in the transformer architecture. Inspired by this observation, since the formulation of visual object tracking task is similar to CAC, we show the advanced attention modules of transformer-based trackers are actually powerful matching tools for the CAC task. These modules allow to learn more distinct features to capture the shared patterns among the query and reference images. In addition, we propose a transformer-based class agnostic counting framework by adapting transformer-based trackers for CAC. We demonstrate the effectiveness of the proposed framework with two state-ofthe-art transformer-based trackers, MixFormer and TransT, with extensive experiments and ablation studies. The proposed methods outperform other state-of-theart methods on the challenging FSC-147 and CARPK datasets and achieve new state-of-the-art performances. The code will be publicly available upon acceptance.

1. INTRODUCTION

Object counting is a popular research topic in the vision community with a wide spread of applications, including visual surveillance, intelligent agriculture, etc. It aims to count the occurrence number of target objects in an image. The object counting methods can be classified into two major categories: class-specific object counting and class-agnostic counting. For class-specific object counting, it usually focuses on counting a specific category such as car, animal, or people, etc, where crowd counting to count the number of a crowd of people is well studied by Song et al. ( 2021 2021) arises recently and aims to count any novel objects within the query image, especially for those objects of unseen classes during the training stage. Given several reference object images of the target class, the CAC model is able to predict the number of occurrences within the query image. Current models share a similar network architecture, consisting of a feature extractor, a matching module, and a density head. Once the query and reference feature maps are extracted from the feature extractor, they are fed into the matching module to compute the similarity map followed by a density head to yield a density map estimation. The sum of the values in the density map is used as the final estimated object count. Nevertheless, the major drawbacks of traditional CAC models are their fixed matching framework, which performs template matching in the same fashion regardless of the variations in the patterns of the reference patterns. Moreover, the extracted features for matching are not discriminative enough across categories. Last but not least, traditional models are supervised by pixel-wise root-mean squared error (RMSE) between ground-truth and predicted density map,



); Cheng et al. (2022; 2019); Li et al. (2018). However, it requires to train an individual model for each category with tremendous efforts on collecting thousands of training images with annotations and fail to work for unseen classes. In contrast, class-agnostic counting (CAC) studied by Lu et al. (2018); Ranjan et al. (2021); Yang et al. (

