OBJECT TRACKING BY HIERARCHICAL PART-WHOLE ATTENTION

Abstract

We present in this paper that hierarchical representations of objects can provide an informative and low-noisy proxy to associate objects of interest in multi-object tracking. This is aligned with our intuition that we usually only need to compare a little region of the body of target objects to distinguish them from other objects. We build the hierarchical representation in levels of (1) target body parts, (2) the whole target body, and (3) the union area of the target and other objects of overlap. Furthermore, with the spatio-temporal attention mechanism by transformer, we can solve the tracking in a global fashion and keeps the process online. We design our method by combining the representation with the transformer and name it Hierarchical Part-Whole Attention, or HiPWA for short. The experiments on multiple datasets suggest its good effectiveness. Moreover, previous methods mostly focus on leveraging transformers to exploit long temporal context during association which requires heavy computation resources. But HiPWA focuses on a more informative representation of objects on every single frame instead. So it is more robust with the length of temporal context and more computationally economic.

1. INTRODUCTION

How to represent the visual existence of an object in a discriminative fashion is a core question of computer vision. In this paper, we propose a hierarchical part-whole representation to represent the visual existence of objects. We adopt multi-object tracking as the application area since the distinguishable appearance feature is critical to avoid the mismatch among target objects when tracking across frames. To gather and process the visual information from different levels, we combine the hierarchical part-whole representation with the attention mechanism from transformers to summarize distinguishable and discriminative visual representations for objects of interest. In the task of multi-object tracking, given a bounding box to localize objects of interest, how should we recognize the major object within the box and distinguish it from the background and other objects, especially some also having partial existence in the box? We believe the visual specificity of one object comes from three perspectives: the compositional, the semantic and the contextual. The compositional suggests the salient and unique visual regions on an object, such as a hat on a pedestrian whose color is different from all others in the same image. With a salient visual composition attached to an object, we can track it across frames even without seeing its full body. The semantic visual information is the commonly adopted one in modern computer vision such as a tight bounding box or instance segmentation mask. It defines the occupancy area of the object with the bond between its visual existence and semantic concept. Finally, contextual visual information describes the surroundings of an object. It helps to distinguish an object via contrast. For example, the bounding box might contain pixels from the background and secondary objects. However, a tight bounding box offers a strong underlying prior when combined with visual context: an object whose parts span across the boundary of the bounding box should not be the major object of this bounding box. Being the secondary object or not an object of interest, it should be regarded as noise when we generate a distinguishable visual representation for the major subject in the bounding box. The analysis above shows each level has its value to represent an object discriminatively. Motivated by the insight, we propose to represent an object by a three-level hierarchy: body parts, full body, and the union area including objects with overlap. We summarize it as a "Part-Body-Union" hierarchy. With the hierarchy constructed, an ideal path to solving the target association in multi-object tracking is to leverage the salient information within the body area and discard mismatch by eliminating the 1

