OBJECT TRACKING BY HIERARCHICAL PART-WHOLE ATTENTION

Abstract

We present in this paper that hierarchical representations of objects can provide an informative and low-noisy proxy to associate objects of interest in multi-object tracking. This is aligned with our intuition that we usually only need to compare a little region of the body of target objects to distinguish them from other objects. We build the hierarchical representation in levels of (1) target body parts, (2) the whole target body, and (3) the union area of the target and other objects of overlap. Furthermore, with the spatio-temporal attention mechanism by transformer, we can solve the tracking in a global fashion and keeps the process online. We design our method by combining the representation with the transformer and name it Hierarchical Part-Whole Attention, or HiPWA for short. The experiments on multiple datasets suggest its good effectiveness. Moreover, previous methods mostly focus on leveraging transformers to exploit long temporal context during association which requires heavy computation resources. But HiPWA focuses on a more informative representation of objects on every single frame instead. So it is more robust with the length of temporal context and more computationally economic.

1. INTRODUCTION

How to represent the visual existence of an object in a discriminative fashion is a core question of computer vision. In this paper, we propose a hierarchical part-whole representation to represent the visual existence of objects. We adopt multi-object tracking as the application area since the distinguishable appearance feature is critical to avoid the mismatch among target objects when tracking across frames. To gather and process the visual information from different levels, we combine the hierarchical part-whole representation with the attention mechanism from transformers to summarize distinguishable and discriminative visual representations for objects of interest. In the task of multi-object tracking, given a bounding box to localize objects of interest, how should we recognize the major object within the box and distinguish it from the background and other objects, especially some also having partial existence in the box? We believe the visual specificity of one object comes from three perspectives: the compositional, the semantic and the contextual. The compositional suggests the salient and unique visual regions on an object, such as a hat on a pedestrian whose color is different from all others in the same image. With a salient visual composition attached to an object, we can track it across frames even without seeing its full body. The semantic visual information is the commonly adopted one in modern computer vision such as a tight bounding box or instance segmentation mask. It defines the occupancy area of the object with the bond between its visual existence and semantic concept. Finally, contextual visual information describes the surroundings of an object. It helps to distinguish an object via contrast. For example, the bounding box might contain pixels from the background and secondary objects. However, a tight bounding box offers a strong underlying prior when combined with visual context: an object whose parts span across the boundary of the bounding box should not be the major object of this bounding box. Being the secondary object or not an object of interest, it should be regarded as noise when we generate a distinguishable visual representation for the major subject in the bounding box. The analysis above shows each level has its value to represent an object discriminatively. Motivated by the insight, we propose to represent an object by a three-level hierarchy: body parts, full body, and the union area including objects with overlap. We summarize it as a "Part-Body-Union" hierarchy. With the hierarchy constructed, an ideal path to solving the target association in multi-object tracking is to leverage the salient information within the body area and discard mismatch by eliminating the noise revealed by the contextual contrast. Without requiring more fine-grained data annotation, we propose to use transformers to process the hierarchical representation as the attention mechanism can discover important visual information. So, by combining the hierarchical visual representation and attention-based feature fusion, we finally propose our method as Hierarchical Part-Whole Attention, or HiPWA for short. In this work, we build a baseline model following this design and demonstrate its effectiveness in solving multi-object tracking problems. Through experiments on multiple multiobject tracking datasets, the proposed method achieves comparable or even better performance than the state-of-the-art transformer-based methods with a more lightweight implementation and better time efficiency during training and inference.

2.1. REPRESENTING OBJECTS BY PARTS

The most commonly used object representation for multi-object tracking is bounding boxes. However, the bounding box is noisy by containing background pixels and pixels from secondary objects. On the other hand, our life experience demonstrates that, in many scenarios, it is not necessary to observe the full body of objects to specify an object visually and tracking targets by the distinguishable parts on it is usually more efficient. Therefore, researchers also have been studying object detection and tracking with more fine-grained representation. A common way is to use pre-defined certain parts on target bodies, such as only human head (Sundararaman et al., 2021; Shao et al., 2018) , human joints (Andriluka et al., 2018; Xiu et al., 2018) or even every pixel (Voigtlaender et al., 2019; Weber et al., 2021) . However, all these choices require more fine-grained data annotation beyond bounding boxes and more fine-grained perception modules beyond just normally available object detectors. In the contrast, the part-whole hierarchy we construct requires no additional annotations and we still solve tracking tasks at the granularity of bounding boxes. The idea of modeling objects with different levels is inspired by the hierarchical modeling of the human body (Marr, 2010) by David Marr when he explains how to construct the visual structure of an object from primal sketch to 2.5 sketch and further 3D representation. His classic three levels of visual information processing system concludes this in a higher-level: the computational, the algorithmic, and the implementational. A similar theory is also introduced by Fodor & Pylyshyn (1988) as the semantic, the syntactic, and the physical. Compared to these cognitive theories aiming to model general visual representation, the three perspectives we propose to recognize an object and distinguish it from others (the compositional, the semantic and the contextual) only apply to the specific problem of generating an effective visual descriptor to represent the objects of interest.

2.2. TRANSFORMER-BASED MULTI-OBJECT TRACKING

Transformer (Vaswani et al., 2017) is originally proposed for natural language processing. It shows a powerful capacity for information representation and processing. Later, DETR (Carion et al., 2020) introduces the transformer to the area of visual perception for object detection. It models object detection as solving a bipartite matching problem. Given that the matching-based strategy by DETR is quite similar to the target matching in the task of multi-object tracking, it is intuitive to further migrate transformer to this area. TransTrack (Sun et al., 2020) is the first work using the transformer to solve the MOT problem but it does not invent any association strategy by transformers. A concurrent work TrackFormer (Meinhardt et al., 2021) takes a further step to use the cross attention in transformer decoder in the stage of association by query passing. On the other hand, VisTR (Wang et al., 2021c) proposes a novel global association scheme upon transformer where a video clip of multiple frames is forward into the transformer at the same time to associate objects within the clip. More recently, many works (Zhou et al., 2022; Zeng et al., 2021) follow the global association scheme in either training or inference and achieve good performance. A key to their success is to process the information over a long temporal period, which can be hardly handled without the transformer. GTR (Zhou et al., 2022 ) makes a baseline model of using only appearance in associating objects and removing some secondary modules such as positional encoding and learnable object query. However, a downside of processing multiple frames as a batch by the transformer is the high requirement of computation resources. It has become a common practice to train the model on at least 4xV100 GPUs (Zhou et al., 2022; Sun et al., 2020; Zeng et al., 2021) or even 8xA100 GPUs (Cai et al., 2022) . These methods usually suffer from significant performance drop if only limited computation resource is available. This is because they usually make improvements to association performance by taking advantage of a long temporal window and gathering more visual context within it. In

