COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION

Abstract

Prompt tuning with large-scale pretrained vision-language models empowers open-vocabulary predictions trained on limited base categories, e.g., object classification and detection. In this paper, we propose compositional prompt tuning with motion cues: an extended prompt tuning paradigm for compositional predictions of video data. In particular, we present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where conventional prompt tuning is easily biased to certain subject-object combinations and motion patterns. To this end, RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse spatio-temporal motion patterns of the subject-object compositions. Without bells and whistles, our RePro achieves a new state-of-the-art performance on two Vid-VRD benchmarks of not only the base training object and predicate categories, but also the unseen ones. Extensive ablations also demonstrate the effectiveness of the proposed compositional and multi-mode design of prompts.

1. INTRODUCTION

Video visual relation detection (VidVRD) aims to detect the visual relationships between object tracklets in videos as <subject, predicate, object> triplets (Shang et al., 2017; Chen et al., 2021; 2023; Gao et al., 2021; 2022) , e.g., dog-towards-child shown in Figure 1 . Compared to its counterpart in still images (Chen et al., 2019; Li et al., 2022b; c; d; e) , due to the extra temporal axis, there are usually multiple relationships with different temporal scales, and a subject-object pair can have several predicates with ambiguous boundaries. For example, as shown in Figure 1 , the action feed of child to dog co-occurs with several other predicates (e.g., away, towards). This characteristic makes VidVRD have more plentiful and diverse relations between objects than its image counterpart. As a result, it is impractical to collect sufficient annotations for all categories for VidVRD. Therefore, to make VidVRD practical, we should know how to generalize the model, trained on limited annotations, to new object and predicate classes unseen in training data. 

Multi-mode Prompt Groups

Motion Cues Learning the prompt representation is actually introducing some priors for describing the context of target classes, and it excludes some impossible classes with the constraint of the context. However, the prompt representations (either handcrafted or learned) in above approaches are monotonous and static, and learning the prompt sometimes might break the "open" knowledge due to overfitting to the base category training data. Modeling the prompt representation for video visual relations has some specific characteristics that need to be considered:

[a][video][of][dog] [away] [child] [𝒘𝒘

• Compositional: The prompt context for predicates is highly related to the semantic roles of subject and object. A holistic prompt representation might be sub-optimal for predicates. For example, as shown in Figure 1 , even the same predicate (sit on) in different relation triplets (dog-sit on-floor and child-sit on-stool) have totally different visual context. • Motion-related: Predicates with different motion patterns naturally should be prompted with different context tokens. The naive prompt representation fails to consider the spatio-temporal motion cues of tracklet pairs. For example, the predicate towards shown in Figure 1 can be prompted as "a relation of [CLASS], moving closer". In contrast, eat and sit on can be prompted as "a relation of [CLASS], relative static". In this paper, we propose a compositional and motion-based Relation Prompt learning framework: RePro, as shown in Figure 2 (c). To deal with the compositional characteristic of visual relations, we set compositional prompt representations specified with subject and object respectively. With this design, we can model the prompt context w.r.t. semantic roles (i.e., subject or object). For example, a possible prompt can be "sth. doing [CLASS]" for the subject and "sth. being [CLASS]" for the object. To consider the motion-related characteristic of predicate contexts, we design multimode prompt groups, where each group (i.e., each mode) is assigned with a certain motion pattern, and has its own compositional prompts for the subject and object. During the implementation, we



Figure 1: Examples of VidVRD. The relation graphs are w.r.t the whole video clip. Dashed lines denote unseen new categories in the training data.

𝟏𝟏 ] [𝒘𝒘 𝟐𝟐 ] ... [𝒘𝒘 𝑳𝑳 ] [away] [𝒔𝒔 𝟏𝟏 ] [𝒔𝒔 𝟐𝟐 ] ... [𝒔𝒔 𝑳𝑳 ] 𝟏𝟏 ] [𝒘𝒘 𝟐𝟐 ] ... [𝒘𝒘 𝑳𝑳 ] [towards] ... Selector [𝒐𝒐 𝟏𝟏 ] [𝒐𝒐 𝟐𝟐 ] ... [𝒐𝒐 𝑳𝑳 ] 𝟏𝟏 ] [𝒔𝒔 𝟐𝟐 ] ... [𝒔𝒔 𝑳𝑳 ] [𝒐𝒐 𝟏𝟏 ] [𝒐𝒐 𝟐𝟐 ] ... [𝒐𝒐 𝑳𝑳 ]

Figure 2: Comparisons of different prompt tuning methods for Open-VidVRD.To this end, we propose a new task: Open-vocabulary VidVRD (Open-VidVRD). In particular, "open" does not only mean unseen relationship combinations, e.g., dog-sit on-floor, but also unseen objects and predicates, e.g., bread and feed, as shown in Figure1. Recent works on such generalization only focus on the unseen combinations(Chen et al., 2021; Shang et al., 2021)  in Vid-VRD, or zero-shot transfer among semantically related objects in zero-shot object detection(Huang  et al., 2022), e.g., the seen dog class can help to recognize the unseen wolf. However, they fail to generalize to the categories totally unrelated to the limited seen ones, where the transfer gap is unbridgeable, e.g., bread in testing has no visual similarity with dog and child in training.Thanks to the encyclopedic knowledge acquired by large vision-language models (VLMs) pretrained on big data(Radford et al., 2021; Li et al., 2022a), we can achieve open-vocabulary relation detection with only training data of limited base categories. To bridge the gap between the pretrained and downstream tasks without extra fine-tuning the whole VLM model, a trending technique named prompt tuning is widely adopted(Liu et al., 2021; Jin et al., 2022; Zhou et al., 2022b). For example, we can achieve zero-shot relation classification for the tracklets pair in Figure2. We first crop the object tracklet regions in the video, and feed them into the visual encoder of VLM to obtain corresponding visual embeddings. Then we use a simple prompt like "a video of [CLASS]", feed it to the VLM's text encoder to obtain the text embedding, and classify the object based on the similarities between visual and text embeddings. Based on the tracklet classification results, for the example of the pair dog and child, we can craft a prompt like "a video of dog [CLASS] child", as shown in Figure2(a), and similarly classify their predicates based on the predicate text embeddings. Furthermore, we can replace the fixed prompt tokens with learnable continuous tokens, as shown in Figure2(b), known as prompt representation learning, which has been widely applied to open-vocabulary object detection(Gu et al., 2021; Du et al., 2022; Ma et al., 2022).

availability

//github.com/

