WEAKLY-SUPERVISED HOI DETECTION VIA PRIOR-GUIDED BI-LEVEL REPRESENTATION LEARNING

Abstract

Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks. One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only. This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal. A promising strategy to address those challenges is to exploit knowledge from large-scale pretrained models (e.g., CLIP), but a direct knowledge distillation strategy (Liao et al., 2022) does not perform well on the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations. Experimental results on HICO-DET and V-COCO show that our method outperforms the previous works by a sizable margin, showing the efficacy of our HOI representation.

1. INTRODUCTION

Human object interaction detection aims to simultaneously localize the human-object regions in an image and to classify their interactions, which serves as a fundamental building-block in a wide range of tasks in human-centric artificial intelligence, such as human activity recognition (Heilbron et al., 2015; Tina et al., 2021) , human motion tracking (Wafae et al., 2019; Nishimura et al., 2021) and anomalous behavior detection (Liu et al., 2018; Pang et al., 2020) . Usually, HOI detection adopts a supervised learning paradigm (Gupta & Malik, 2015; Chao et al., 2018; Wan et al., 2019; Gao et al., 2020; Zhang et al., 2021c) . This requires detailed annotations (i.e. human and object bounding boxes and their interaction types) in the training stage. However, such HOI annotations are expensive to collect and prone to labeling errors. In contrast, it is much easier to acquire image-level descriptions of target scenes. Consequently, a more scalable strategy for HOI detection is to learn from weak annotations at the image level, known as weakly-supervised HOI detection (Zhang et al., 2017) . Learning under such weak supervision is particularly challenging mainly due to the lack of accurate visual-semantic associations, large search space of detecting HOIs and highly noisy training signal from only image level supervision. Most existing works (Zhang et al., 2017; Baldassarre et al., 2020; Kumaraswamy et al., 2021) attempt to tackle the weakly-supervised HOI detection in a Multiple Instance Learning (MIL) framework (Ilse et al., 2018) . They first utilize an object detector to generate human-object proposals and then train an interaction classifier with image-level labels as supervision. Despite promising results, these methods suffer from several weaknesses when coping with diverse and fine-grained HOIs. Firstly, they usually rely on visual representations derived from the external object detector, which mainly focus on the semantic concepts of the objects in the scene and hence are insufficient for capturing the concept of fine-grained interactions. Secondly, as the image-level supervision tends to ignore the imbalance in HOI classes, their representation learning is more susceptible to the dataset bias and dominated by frequent interaction classes. Finally, these methods learn the HOI concepts from a candidate set generated by pairing up all the human and object proposals, which is highly noisy and often leads to erroneous human-object associations for many interaction classes. To address the aforementioned limitations, we introduce a new weakly-supervised HOI detection strategy. It aims to incorporate the prior knowledge from pretrained foundation models to facilitate the HOI learning. In particular, we propose to integrate CLIP (Radford et al., 2021b) , a large-scale vision-language pretrained model. This allows us to exploit the strong generalization capability of the CLIP representation for learning a better HOI representation under weak supervision. Compared to the representations learned by the object detector, the CLIP representations are inherently less object-centric, hence more likely to incorporate also aspects about the human-object interaction, as evidenced by Appendix A. Although a few works have successfully exploited CLIP for supervised HOI detection in the past, experimentally we find they do not perform well in the more challenging weakly-supervised setting (c.f. Appendix.B). We hypothesize this is because they only transfer knowledge at image level, and fail without supervision at the level of human-object pairs. To this end, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge of HOIs at two different levels. First, at the image level, we utilize the visual and linguistic embeddings of the CLIP model to build a global HOI knowledge bank and generate image-level HOI predictions. In addition, for each human-object pair, we enrich the region-based HOI features by the HOI representations in the knowledge bank via a novel attention mechanism. Such a bi-level framework enables us to exploit the image-level supervision more effectively through the shared HOI knowledge bank, and to enhance the interaction feature learning by introducing the visual and text representations of the CLIP model. We instantiate our bi-level knowledge integration strategy as a modular deep neural network with a global and local branch. Given the human-object proposals generated by an off-the-shelf object detector, the global branch starts with a backbone network to compute image feature maps, which are used by a subsequent HOI recognition network to predict the image-wise HOI scores. The local branch builds a knowledge transfer network to extract the human-object features and augment them with the CLIP-guided knowledge bank, followed by a pairwise classification network to compute their relatedness and interaction scoresfoot_0 . The relatedness scores are used to prune incorrect human-object associations, which mitigates the issue of noisy proposals. Finally, the outputs of the two branches are fused to generate the final HOI scores. To train our HOI detection network with image-level annotations, we first initialize the backbone network and the HOI knowledge bank from the CLIP encoders, and then train the entire model in an end-to-end manner. In particular, we devise a novel multi-task weak supervision loss consisting of three terms: 1) an image-level HOI classification loss for the global branch; 2) an MIL-like loss for the interaction scores predicted by the local branch, which is defined on the aggregate of all the human-object pair predictions; 3) a self-taught classification loss for the relatedness of each human-object pair, which uses the interaction scores from the model itself as supervision. We validate our methods on two public benchmarks: HICO-DET (Chao et al., 2018) and V-COCO (Gupta & Malik, 2015) . The empirical results and ablative studies show our method consistently achieves state-of-the-art performance on all benchmarks. In summary, our contributions are three-fold: (i) We exploit the CLIP knowledge to build a prior-enriched HOI representation, which is more robust for detecting fine-grained interaction types and under imbalanced data distributions. (ii) We develop a self-taught relatedness classification loss to alleviate the problem of mis-association between human-object pairs. (iii) Our approach achieves state-of-the-art performance on the weakly-supervised HOI detection task on both benchmarks.

2. RELATED WORKS

HOI detection: Most works on supervised HOI detection can be categorized in two groups: two-stage and one-stage HOI detection. Two-stage methods first generate a set of human-object proposals with an external object detector, then classify their interactions. They mainly focus on exploring additional human pose information (Wan et al., 2019; Li et al., 2020a; Gupta et al., 2019) , pairwise relatedness (Li et al., 2019a; Zhou et al., 2020) or modeling relations between object and human (Gao et al., 2020; Zhang et al., 2021c; Ulutan et al., 2020; Zhou & Chi, 2019) , to enhance the HOI representations. One-stage methods predict human & object locations and their interaction types simultaneously in an end-to-end manner, which are currently dominated by transformer-based architectures (Carion et al., 2020; Kim et al., 2022; Dong et al., 2022; Zhang et al., 2021a; b) .



Relatedness indicates whether a human-object pair has a relation, and interaction scores are multi-label scores on the interaction space.

