SEMI-SUPERVISED COUNTING VIA PIXEL-BY-PIXEL DENSITY DISTRIBUTION MODELLING

Abstract

This paper focuses on semi-supervised crowd counting, where only a small por-1 tion of the training data are labeled. We formulate the pixel-wise density value 2 to regress as a probability distribution, instead of a single deterministic value. 3 On this basis, we propose a semi-supervised crowd counting model. Firstly, we 4 design a pixel-wise distribution matching loss to measure the differences in the 5 pixel-wise density distributions between the prediction and the ground-truth; Sec-6 ondly, we enhance the transformer decoder by using density tokens to specialize 7 the forward propagations of decoders w.r.t. different density intervals; Thirdly, we 8 design the interleaving consistency self-supervised learning mechanism to learn 9 from unlabeled data efficiently. Extensive experiments on four datasets are per-10 formed to show that our method clearly outperforms the competitors by a large 11 margin under various labeled ratio settings. Code will be released. 12

1. INTRODUCTION

13 Crowd counting (Zhang et al., 2016; Cao et al., 2018; Ma et al., 2019) is becoming increasingly 14 important in computer vision. It has wide applications such as congestion estimation and crowd 15 management. A lot of fully-supervised crowd counting models have been proposed, which require 16 a large number of labeled data to train an accurate and stable model. However, considering the 17 density of the crowd, it is laborious and time-consuming to annotate the center of each person's 18 head in a dataset of all dense crowd images. To alleviate the requirement for large amounts of 19 labeled data, this paper focuses on semi-supervised counting where only a small portion of training 20 data are labeled (Liu et al., 2018b) .

21

Traditional semi-supervised counting methods target density regression and then leverage self-22 supervised criteria (Liu et al., 2018b; 2019b) or pseudo-label generation (Sindagi et al., 2020b; 23 Meng et al., 2021) to exploit supervision signals under unlabeled data. These methods are designed 24 to directly generate density maps, where each pixel is associated with a definite value. However, it 25 is still extremely difficult to learn a good model due to the uncertainty of pixel labels. Firstly, there 26 are commonly erroneous head locations in the annotations (Wan & Chan, 2020; Bai et al., 2020) ; 27 Secondly, the pseudo labels for unlabeled training data assigned by the models are pervasively noisy. 28 To address these challenges, we propose a new semi-supervised counting model, termed by the 29 Pixel-by-Pixel Probability distribution modelling Network (P 3 Net). Unlike traditional methods 30 which generate a deterministic pixel density value, we model the targeted density value of a pixel as 31 a probability distribution. On this premise, we contribute to semi-supervised counting in four ways. 32 • We propose a Pixel-wise probabilistic Distribution (PDM) loss to match the distributions of the 33 predicted density values and the targeted ones pixel by pixel. The PDM loss, designed in line 34 with the discrete form of the 1D Wasserstein distance, measures the cumulative gap between 35 the predicted distribution and the ground-truth one along the density (interval) dimension. By 36 modeling the density intervals probabilistically, our method responds well to the uncertainty in 37 the labels. It thus surpasses traditional methods that regards the density values deterministic.

38

• We incorporate the transformer decoder structure with a density-token scheme to modulate the 39 features and generate high-quality density maps. A density token encodes the semantic infor-40 mation of a specific density interval. In prediction, these density-specific tokens specialize the 41 forward propagations of the decoder with respect to the corresponding density intervals.

