COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION

Abstract

Prompt tuning with large-scale pretrained vision-language models empowers open-vocabulary predictions trained on limited base categories, e.g., object classification and detection. In this paper, we propose compositional prompt tuning with motion cues: an extended prompt tuning paradigm for compositional predictions of video data. In particular, we present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where conventional prompt tuning is easily biased to certain subject-object combinations and motion patterns. To this end, RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse spatio-temporal motion patterns of the subject-object compositions. Without bells and whistles, our RePro achieves a new state-of-the-art performance on two Vid-VRD benchmarks of not only the base training object and predicate categories, but also the unseen ones. Extensive ablations also demonstrate the effectiveness of the proposed compositional and multi-mode design of prompts.

1. INTRODUCTION

Video visual relation detection (VidVRD) aims to detect the visual relationships between object tracklets in videos as <subject, predicate, object> triplets (Shang et al., 2017; Chen et al., 2021; 2023; Gao et al., 2021; 2022) , e.g., dog-towards-child shown in Figure 1 . Compared to its counterpart in still images (Chen et al., 2019; Li et al., 2022b; c; d; e) , due to the extra temporal axis, there are usually multiple relationships with different temporal scales, and a subject-object pair can have several predicates with ambiguous boundaries. For example, as shown in Figure 1 , the action feed of child to dog co-occurs with several other predicates (e.g., away, towards). This characteristic makes VidVRD have more plentiful and diverse relations between objects than its image counterpart. As a result, it is impractical to collect sufficient annotations for all categories for VidVRD. Therefore, to make VidVRD practical, we should know how to generalize the model, trained on limited annotations, to new object and predicate classes unseen in training data. 

Multi-mode Prompt Groups

Motion Cues To this end, we propose a new task: Open-vocabulary VidVRD (Open-VidVRD). In particular, "open" does not only mean unseen relationship combinations, e.g., dog-sit on-floor, but also unseen objects and predicates, e.g., bread and feed, as shown in Figure 1 . Recent works on such generalization only focus on the unseen combinations (Chen et al., 2021; Shang et al., 2021) in Vid-VRD, or zero-shot transfer among semantically related objects in zero-shot object detection (Huang et al., 2022) , e.g., the seen dog class can help to recognize the unseen wolf. However, they fail to generalize to the categories totally unrelated to the limited seen ones, where the transfer gap is unbridgeable, e.g., bread in testing has no visual similarity with dog and child in training.

[a][video][of][dog] [away] [child] [𝒘𝒘

Thanks to the encyclopedic knowledge acquired by large vision-language models (VLMs) pretrained on big data (Radford et al., 2021; Li et al., 2022a) , we can achieve open-vocabulary relation detection with only training data of limited base categories. To bridge the gap between the pretrained and downstream tasks without extra fine-tuning the whole VLM model, a trending technique named prompt tuning is widely adopted (Liu et al., 2021; Jin et al., 2022; Zhou et al., 2022b) . For example, we can achieve zero-shot relation classification for the tracklets pair in Figure 2 . We first crop the object tracklet regions in the video, and feed them into the visual encoder of VLM to obtain corresponding visual embeddings. Then we use a simple prompt like "a video of [CLASS]", feed it to the VLM's text encoder to obtain the text embedding, and classify the object based on the similarities between visual and text embeddings. Based on the tracklet classification results, for the example of the pair dog and child, we can craft a prompt like "a video of dog [CLASS] child", as shown in Figure 2 (a), and similarly classify their predicates based on the predicate text embeddings. Furthermore, we can replace the fixed prompt tokens with learnable continuous tokens, as shown in Figure 2 (b), known as prompt representation learning, which has been widely applied to open-vocabulary object detection (Gu et al., 2021; Du et al., 2022; Ma et al., 2022) . Learning the prompt representation is actually introducing some priors for describing the context of target classes, and it excludes some impossible classes with the constraint of the context. However, the prompt representations (either handcrafted or learned) in above approaches are monotonous and static, and learning the prompt sometimes might break the "open" knowledge due to overfitting to the base category training data. Modeling the prompt representation for video visual relations has some specific characteristics that need to be considered: • Compositional: The prompt context for predicates is highly related to the semantic roles of subject and object. A holistic prompt representation might be sub-optimal for predicates. For example, as shown in Figure 1 , even the same predicate (sit on) in different relation triplets (dog-sit on-floor and child-sit on-stool) have totally different visual context. • Motion-related: Predicates with different motion patterns naturally should be prompted with different context tokens. The naive prompt representation fails to consider the spatio-temporal motion cues of tracklet pairs. For example, the predicate towards shown in Figure 1 can be prompted as "a relation of [CLASS], moving closer". In contrast, eat and sit on can be prompted as "a relation of [CLASS], relative static". In this paper, we propose a compositional and motion-based Relation Prompt learning framework: RePro, as shown in Figure 2 (c). To deal with the compositional characteristic of visual relations, we set compositional prompt representations specified with subject and object respectively. With this design, we can model the prompt context w.r.t. semantic roles (i.e., subject or object). For example, a possible prompt can be "sth. doing [CLASS]" for the subject and "sth. being [CLASS]" for the object. To consider the motion-related characteristic of predicate contexts, we design multimode prompt groups, where each group (i.e., each mode) is assigned with a certain motion pattern, and has its own compositional prompts for the subject and object. During the implementation, we select a proper group according to the motion cues (patterns) in the subject-object tracklet pairs (cf. Sec. 3.3) . Compared to some prompt tuning works which focus on category-based context (Zhou et al., 2022b) or instance-conditioned context (Zhou et al., 2022a; Ni et al., 2022) , our motion-cuebased grouping has better cross-category generalization ability, and can avoid the over-fitting to base categories. We evaluate our RePro on the VidVRD (Shang et al., 2017) and VidOR (Shang et al., 2019) benchmarks. Our experiment results show that RePro trained with only the samples of base relation categories has a good generalizability to detect novel relations, and achieves the new stateof-the-art. For example, it outperforms the top-performing method, i.e., VidVRD-II (Shang et al., 2021) , by 2.54% and 3.91% absolute mAP for SGDet and SGCls settings, respectively. Our contributions in this paper are thus three-fold. 1) A new open-vocabulary setting for video visual relation detection task, i.e., Open-VidVRD. 2) A compositional prompt representation learning method that models the prompt contexts for the subject and object separately. 3) A motion-cue-based multi-mode prompt groups that achieve a strong generalization ability.

2. RELATED WORK

Video Visual Relation Detection (VidVRD) was defined in Shang et al. (2017; 2019) together with the proposals of the VidVRD and VidOR benchmarks. The task aims to spatio-temporally localize visual relations between object tracklets. Existing methods mainly focus on modeling better visual or spatio-temporal contexts (Qian et al., 2019; Shang et al., 2021; Cong et al., 2021) , and detecting visual relations with more granularity either by sliding windows (Liu et al., 2020) or temporal grounding (Gao et al., 2022) . They mainly worked on the pre-defined (closed) sets of object and predicate categories. In contrast, our work is the first one to study the open-vocabulary VidVRD setting, where some object and predicate categories are unseen in the training set. Zero-Shot Setting in Image and Video VRD. Existing VRD works, either in image domain (Tang et al., 2020; Kan et al., 2021) or video domain (Shang et al., 2021) , only achieve zero-shot transfer on the unseen triplet combinations, where the objects and predicates are seen in the training set. They ignore the model's generalization ability to unseen object/predicate categories. There is one concurrent work (He et al., 2022) proposes the open-vocabulary setting in image VRD. However, they put the main emphasis on unseen object categories. Different from them, RePro generalizes the model to recognize both object and predicate categories totally unrelated to the seen training ones. Prompt Tuning for Open-vocabulary Visual Recognition. Prompt tuning (Liu et al., 2021) has been widely adopted in both image (e.g., open-vocabulary object detection (OV-Det) (Gu et al., 2021; Du et al., 2022; Ma et al., 2022) ) and video (e.g., zero-shot video action recognition (Lin et al., 2022; Ni et al., 2022; Ju et al., 2022; Nag et al., 2022) ) domains. For OV-Det, recent works mainly focus on knowledge distillation from VLM and simply using handcrafted prompt (Gu et al., 2021; Ma et al., 2022) , or focus on prompt representation learning for object regions (Du et al., 2022; Feng et al., 2022) . For video action recognition, existing works mainly use fixed prompt (Nag et al., 2022) , conventional learnable prompt (Ju et al., 2022) , or prompt conditioned on the input video contexts (Ni et al., 2022) , and they all focus on the cross-frame attention or feature interaction. In contrast, our RePro learns the compositional prompt by leveraging the motion cues of subject-object pairs, and has better cross-category generalization ability to detect novel visual relations in videos. VLMs (Sec. 3.1) . Then, we introduce the proposed Open-VidVRD method RePro, as illustrated in Figure 3 , in which we first extend open-vocabulary object detection methods (Gu et al., 2021; Du et al., 2022) (Li et al., 2022a) . They first extract text embeddings for all categories by feeding handcrafted prompt (e.g., "a video of [CLASS]") into the text encoder of VLM, where [CLASS] can be replaced with the class name of an arbitrary object or predicate. Their output text embedding t c ∈ R d for each class c is

3. METHOD

t c = VLM txt (W c ), W c = [w 1 , . . . , w L , wc ], ∀c ∈ C O b ∪ C O n or C P b ∪ C P n , where W c is the prompt representation with L context token vectors and the class token vector wc . Then, for each object tracklet with cropped video region, the corresponding visual embedding can be extracted by the visual encoder of VLM, denoted as v i ∈ R d . Similarly, the visual embedding of tracklet pair (i, j) can also be extracted (e.g., based on the union region), denoted as v i,j . Therefore, the i-th region (generally denoted as v i ) can be classified by the cosine similarities w.r.t {t c }: ĉi = arg max c cos(v i , t c ), ∀c ∈ C O b ∪ C O n or C P b ∪ C P n , where cos(x, y) = x T y/( x y ). (2) Learnable Prompt. Manually tuning the words in the prompt requires domain expertise and is time-consuming or not robust (Radford et al., 2021) . The substitute method is to learn the prompt representations from the training data (Zhou et al., 2022b; a) . Specifically, the context vector w i in W c can be set as a learnable vector while wc is kept as fixed. In the training stage, samples and { wc } are from base categories. In the testing stage, the L learned vectors in each W c are fixed and then the model performs classification in the same way as Eq. (2).

3.2. OPEN-VOCABULARY OBJECT TRACKLET DETECTION

Tracklet Proposal Generation. Given a video, we first detect all the class-agnostic object tracklets using a pre-trained tracklet detector, denote as T = {T i } N i=1 , as shown in Figure 3 (a). Specifically, each tracklet T i is characterized with a bounding box sequence and the corresponding RoI Aligned (He et al., 2017) visual feature, To reduce the computational overhead, we average the RoI features of all bounding boxes (i.e., along the temporal axis of the tracklet) following (Shang et al., 2021) , and denote it as f i ∈ R 2048 . Tracklet Classification. Instead of directly classifying object tracklets using VLM as Eq. ( 2), we train a visual-to-language (V2L) projection module φ o (•) to further utilize the annotations of base classes. In particular, φ o (•) maps the RoI Aligned feature f i of each tracklet to the same semantic space R d , i.e., v i = φ o (f i ). Let t o c be the text embedding of object class c ∈ C O b . The probability of tracklet T i being classified as class c can be calculated as p i (c) = exp(cos(v i , t o c )/τ ) c ∈C O b exp(cos(v i , t o c )/τ ) , ∀c ∈ C O b , where τ is a temperature parameter for softmax. Training Objectives. To train the object tracklet classification module, we assign base category labels to detected tracklets according to the IoU w.r.t ground-truth tracklets. We call those tracklets with assigned labels as positive tracklets, otherwise negative tracklets. Note that two cases can be recognized as negative tracklets: 1) the content is background; and 2) the content contains a novel object category. For these negative tracklets, we follow the loss used by Du et al. (2022) that forces the prediction (from any negative tracklet) on each base class to be 1/|C O b |, i.e., unlike any base category. Therefore, the classification loss for positive and negative tracklets can be calculated as: L cls-pos = - 1 |T p | Ti∈Tp c∈C O b 1 {c=c * i } log p i (c), L cls-neg = - 1 |T n | Ti∈Tn c∈C O b 1 |C O b | log p i (c), where T p and T n are the sets of positive and negative tracklets, respectively (i.e., T p ∪ T n = T ), and c * i is the ground-truth label for the i-th positive tracklet. We empirically found (in Sec. 3.2) that using the above negative tracklet loss works better than using the loss with a unique "background" class (Zareian et al., 2021; Gu et al., 2021) . Besides, following (Gu et al., 2021) , we distill the knowledge from a pre-trained visual encoder to φ o (•) by aligning v i to v i using l 1 loss, i.e., L distill = (1/N ) N i=1 v i -v i 1 (5) Therefore, the overall loss for object tracklet classification is L cls = L cls-pos + L cls-neg + λL distill , where λ is a hyper-parameter to weight the classification and distillation.

3.3. OPEN-VOCABULARY VISUAL RELATION CLASSIFICATION

Based on the classified object tracklets, we perform open-vocabulary relation classification for each tracklet pair, as shown in Figure 3 (b). First, we learn the prompt representations based on the preextracted visual embeddings of tracklet pairs, for which we introduce the compositional prompt representations and the motion-based prompt groups. Second, we utilize the pre-extracted RoI Aligned features to train a visual-to-language (V2L) projection module based on the learned prompt representations. Finally, for testing, we extract all predicate text embeddings and classify the predicates of each tracklet pair by using the RoI Aligned features and the trained V2L projection module. Compositional Prompt Representations. The compositional prompt consists learnable prompt representations S c and O c (of predicate class c) for subject and object, respectively: S c = [s 1 , . . . , s L , wc ], O c = [o 1 , . . . , o L , wc ], where s i and o i are the learnable context vectors and wc is the fixed class token for predicate c (for c ∈ C P b in training phase and for all c in testing phase). Then, the predicate text embedding t p c is generated by concatenating the two outputs of VLM given two prompts (respectively) as inputs, i.e., t p c = [VLM txt (S c ), VLM txt (O c )], and t p c ∈ R 2d . Motion-based Prompt Groups. We vary the prompt contexts based on the motion cues, i.e., the relative spatio-temporal motion patterns, between each pair of subject and object. In specific, we take the generalized IoU Rezatofighi et al. (2019) (i.e., GIoU) as the metric to calculate the motion patterns. For each tracklet pair <T i , T j >, we use a vector to represent a motion pattern: m i,j = sign([G s i,j -γ, G e i,j -γ, G e i,j -G s i,j ]), and m i,j ∈ {+, -} 3 , where G s i,j , G e i,j are the GIoU between subject-object for the start and end bounding boxes of their temporal intersection, respectively, and γ is a threshold for GIoU. This definition considers two perspectives: 1) whether the two tracklets are near or far (i.e., the first two terms of Eq. ( 8)), and 2) whether they move toward or away to each other (i.e., the third term of Eq. ( 8). Overall, we have 6 motion patterns (cf. Sec. A.2 for more details) and build 6 prompt groups correspondingly. Each group consists its own compositional prompt representations S c and O c as defined in Eq. ( 6). It's worth noting that we aim to build a framework for learning motion-based multi-mode prompts. The used GIoU-based approach (in our framework) is a simple and intuitive way to calculate motion cues. This approach is not perfect, e.g., it is poor to capture the motion pattern of tracklets moving back and forth. We leave other fancier (motion capturing) approaches as future work. Training Objectives. Based on the above definition, we train the prompt representations with visual embeddings and relative position features. For simplicity, we show the training process of a single group (and in the end, we derive the final loss by averaging the losses across all groups). For each tracklet pair <T i , T j >, we first calculate its motion cue, select the corresponding prompt group, and extract the class text embeddings t p c for each predicate class c ∈ C P b . Then, we take the pre-extracted visual embeddings v i and v j , and concatenate them as the pair's visual embedding Shang et al. (2021) , we additionally compute the relative position feature between bounding boxes of T i and T j , denoted as f s i,j ∈ R 12 (cf. Sec. A.2 in the Appendix for more details). The predicted probability of predicate class c in this tracklet pair is thus v i,j = [v i , v j ] ∈ R 2d . Following p i,j (c) = Sigmoid(cos(v i,j + φ pos (f s i,j ), t p c )), ∀c ∈ C P b . where φ pos projects f s i,j to the same dimension as v i,j . Going through all base classes (in C P b ), the probability vector of tracklet pair <T i , T j > is generated and can be denoted as p i,j , i.e., each dimension is calculated by Eq. ( 9). In the training time, we assign the predicate labels according to the IoU for each tracklet pair w.r.t the ground-truth tracklet pair by following Shang et al. (2021) . We denote the sets of positive and negative tracklet pairs as P p and P n , respectively. Due to the multi-label setting of VidVRD, we use binary cross-entropy loss for relation classification. The ground truth for positive tracklet pair in P p is a binary vector of dimension |C P b |, denoted as p * i,j . For those negative tracklet pairs in P n , we optimize the probability of each base class to zero, i.e., the ground truth is an all-zero vector. The classification loss is thus calculated as L pred-cls = (1/|P p |) (Ti,Tj )∈Pp BCE(p i,j , p * i,j ) + (1/|P n |) (Ti,Tj )∈Pn BCE(p i,j , 0). ( ) Training V2L Projection Module. Once the prompt representations are learned, we train a visualto-language (V2L) projection module to use RoI Aligned features {f i } as training data, and get rid of VLM's visual encoder at inference time. Given the learned prompt representations, we pre-extract the predicate class text embeddings (denoted as { tp c }) for each prompt group and fix them. Formally, for each tracklet pair <T i , T j >, we concatenate their RoI features as f i,j = [f i , f j ] ∈ R 4096 . Then, we use a V2L projection module φ p to project it to the same dimension as text embeddings. Similar to Eq. ( 9), the probability of predicate class c is predicted as p i,j (c) = Sigmoid(cos(φ p (f i,j ) + φ pos (f s i,j ), tp c )), ∀c ∈ C P b . ( ) where φ pos is the learned spatio-temporal projection layer and is fixed. Then, we apply the same loss as defined in Eq. ( 10), and compute the final total loss by averaging across all groups. Discussions. Intuitively, we can train the prompt representations together with the V2L projection module φ p , and use the l 1 loss to align φ p (f i,j ) to v i,j , i.e., distill the knowledge from the pre-trained visual encoder to the V2L module. We name this variant as RePro † . We justify that our RePro works better than RePro † due to: 1) directly using the teacher (i.e., v i,j ) to train the prompt is intuitively better than using student (i.e., projected visual embedding), and 2) the distillation makes the V2L module focus too much on the static visual alignment, rather than the dynamic relation information learned in the prompt. In experiments, we empirically show the superiority of RePro over RePro † .

4.1. DATASETS AND EVALUATION METRICS

Datasets. We evaluated our method on the VidVRD (Shang et al., 2017) and VidOR (Shang et al., 2019) benchmarks: 1) VidVRD consists of 1,000 videos, and covers 35 object categories and 132 predicate categories. We used official splits: 800 videos for training and 200 videos for testing. 2) VidOR consists of 10,000 videos, which covers 80 object categories and 50 predicate categories. We used official splits: 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing. Since the annotations of VidOR-test are not released, we only evaluated models on validation set. Evaluation Settings. To build the open-VidVRD setting, we manually split base and novel categories by selecting the common object and predicate categories as the base split, and selecting the rare ones as the novel split. The detailed splits are given in Sec. A.8 of the Appendix. We trained the model on the triplet samples of both base object and predicate categories in the training set. During testing, we evaluated the model on two settings: 1) Novel-split: triplet samples with all object categories and novel predicate categories, and 2) All-splits: triplet samples with all object and predicate categories, in the testing set of VidVRD (or the validation set of VidOR). Table 1 : Performance (%) of tracklet detection on objects with novel categories.

Methods

Distillation BG-Embd VidVRD-test VidOR-val R@5 R@10 R@5 R@10 ALPro (Li et al., Metrics. We follow three standard evaluation tasks in scene graph generation (Zellers et al., 2018) : scene graph detection (SGDet), scene graph classification (SGCls), and predicate classification (PredCls). We apply these metrics to VidVRD: a detected triplet is considered to be correct if there is the same triplet tagged in the ground truth, and both subject and object tracklets have a sufficient volume IoU (e.g., 0.5) with the ground truth. Following the standard setting (Shang et al., 2017) , we use mAP and Recall@K (R@K, K=50,100) as evaluation metrics.

4.2. IMPLEMENTATION DETAILS

Tracklet Detector & Pre-trained VLM. We used the Faster-RCNN (Ren et al., 2015) -based VinVL model (Zhang et al., 2021) to detect frame-level object bounding boxes and extracted corresponding RoI Alinged features, and then adopted Seq-NMS (Han et al., 2016) to generate class-agnostic object tracklets. The VinVL model was trained on out-of-domain image data without seeing any VidVRD data. For pre-trained VLM, we used ALPro (Li et al., 2022a) , which was pre-trained on a wide range of video-language data, and learned the fine-grained visual region to text entity alignment. Relation Detection Details. Following the popular segment-based methods (Qian et al., 2019; Shang et al., 2017; 2021) , we first detected visual relations in short video segments, and then adopted greedy relation association algorithm (Shang et al., 2017) to merge the same relation triplets. The detailed hyperparameter settings are left in Sec. A.3 of the Appendix.

4.3. EVALUATE OPEN-VOCABULARY OBJECT TRACKLET DETECTION

We evaluated the tracklet detection part of RePro on novel object categories, as shown in Table 1 . Comparison to ALPro. A straightforward baseline to achieve open-vocabulary tracklet detection is directly applying the pre-trained VLM (ALPro in our case) by inputting the tracklet regions into its visual encoder to perform classification, as in Eq. ( 2). However, this has a significant computational overhead due to the heavy pipeline of ALPro's visual encoder. In contrast, our RePro requires much less computational cost, since we only use one projection layer (i.e., φ o ). We thus compare our RePro with the above ALPro baseline. The results in row #4 show that RePro can achieve comparable performances on both datasets, with the projection layer φ o . Negative Tracklet Classification. How to model the negative sample is a key challenge as widely discussed in many open-vocabulary object detection works. There are usually two approaches: 1) using a unique background embedding (BG-Embd) in addition to the class text embeddings (Zareian et al., 2021; Gu et al., 2021) , and 2) only using the class text embeddings, and computing the loss of negative sample as L cls-neg in Eq. ( 4) (Gu et al., 2021) . By comparing rows #3 and #4 of Table 1 , we find that without using background embedding (i.e., as L cls-neg ) achieves better recall, and outperforms the other by a large margin, especially on the more challenging VidOR benchmark. This is because the tracklets recognized as negative may be due to the fact that they contain novel objects (rather than backgrounds), and aligning their embeddings (i.e., different novel class embeddings) to a unique background embedding hurts the model's recognition ability on novel objects. Distillation. We verified the effectiveness of visual distillation (i.e., Eq. ( 5)) by comparing rows #2 and #3 of Table 1 . Obviously, the distillation helps RePro improve the detection recall by a large margin, especially for the more challenging VidOR benchmark. For row #1, we can observe that computing the negative tracklet classification loss as L cls-neg without distillation has extremely low performance. This is because forcing the classification probability of negative tracklet to be 1/|C b | (i.e., by L cls-neg ) and without the guidance from the teacher (i.e., without distillation) make the model has poor generalize ability to novel categories. 

Methods

Training Data SGDet RelTag mAP R@50 R@100 P@1 P@5 P@10 Su et al. ( 2020 Split Methods SGDet SGCls PredCls mAP R@50 R@100 mAP R@50 R@100 mAP R@50 R@100 The relation classification part of our RePro was trained separately by keeping the results of tracklet detection fixed. All of our experiments for relation classification used the same tracklet detection results (which is row #4 in Table 1 ). Comparison to Conventional VidVRD SOTA Methods. We compared our RePro with several SOTA methods in the conventional VidVRD setting, and showed the results in Table 2 . The object tracklets and features used in SOTA methods are not uniform since VidVRD is a very challenging task (see Sec. A.6 for details). We can observe that even when our RePro is trained with only base category samples (while others are with both base and novel category samples), our performance on SGDet tasks is comparable to others'. When trained with both base and novel category samples, our RePro outperforms all other SOTA methods in all SGDet tasks and most RelTag tasks. Comparisons in the Setting of Open-VidVRD. We compared the model performances in the setting of Open-VidVRD and showed results in Table 3 . Since our RePro is the first Open-VidVRD method, we compared it to ALPro (implemented as Eq. ( 2)). We also re-implemented the SOTA method VidVRD-II (Shang et al., 2021) and trained it on base category samples. We replaced its classifier with text embeddings extracted by ALPro's text encoder. For both ALPro and VidVRD-II, we used a fixed (handcrafted) prompt "a video of relation [CLASS]". In addition, we reported the results of RePro's intuitive variant RePro † as mentioned in the "Discussion" of Sec. 3.3. From the results in Table 3 , we can observe that our RePro outperforms ALPro, VidVRD-II and RePro † by a large margin on both Novel-split and All-splits. By comparing RePro to ALPro, we show that, unlike that in tracklet classification, directly applying pre-trained VLM to relation classification is sub-optimal and achieves poor performance. By comparing RePro to VidVRD-II, we demonstrate the superiority of our prompt tuning framework over the fixed prompt design. By comparing RePro to RePro † , we validate the effectiveness of our training scheme for RePro.

4.5. ABLATION STUDIES

We conducted careful ablation studies as shown in Table 4 . Since the compositional and motionbased prompt design is one of our main contributions, we conducted ablations w/o either of them (rows #1, #2 and #5). To further show the effectiveness of our motion pattern design, we designed two variants, i.e., rows #3 (Ens) and #4 (Rand). Their detailed settings are enumerated as follows: #1: It learns a single prompt representation W c as in Eq. ( 1). The obtained predicate text embedding has the half dimensions of t p c in Eq. ( 6). So we calculated the visual embeddings of a tracklet pair as v i,j = v i -v j (different from the concatenated vector v i,j in Eq. ( 9)). #2: Training with Compositional Prompt. By comparing the results in #1 and #2, we can observe that the compositional prompt can effectively improve the performance on Novel-split. Meanwhile, the improvement on All-splits is not significant. We conjecture that the base relations, the majority of All-splits, require less compositional semantic contexts for prompt learning. Motion-based Prompt Groups. By comparing our RePro (#5) vs. #2, we can observe that with the help of motion cues, our RePro achieves significant improvements of recall on all tasks, and also achieves considerable improvements in mAP on most tasks. We can see that the improvement on Novel-split is more significant than that on All-splits, e.g., 14.87%→16.52% vs. 15.28%→15.94% on R@100 of SGDet, showing that the motion-based prompt has a better generalizability for detecting novel relations. Besides, if comparing RePro (#5) to Ens (#3), we can see that RePro outperforms Ens in Novel-split on most tasks, and achieves considerable improvements for All-splits on all tasks. Compared to Rand (#4), RePro achieves clear improvements on most metrics for both Novel-split and All-splits. Ablations on VidOR. We conducted the same ablation studies on VidOR-val, as shown in Table 5 . Firstly, we can observe that the compositional prompt representation shows its efficiency on both Novel-split and All-splits, e.g., 0.86%→1.72% and 9.49%→10.06% on R@50 of SGCls. For the motion-based prompt groups, the improvement of RePro is small due to the biased data distribution (Li et al., 2021) , i.e., the predicate categories strongly depend on the visual cues of subject and object tracklets, making the model predict relations simply based on object appearances without considering motion cues. More results on VidOR are left in the Appendix (Sec. A.7).

5. CONCLUSIONS

In this paper, we introduced the challenging Open-VidVRD task. We analyzed two key characteristics, i.e., compositional and motion-related, when applying prompt tuning in this new task. We proposed a novel method called RePro that learns compositional prompt representations while considering motion-based contexts. Our evaluations on both conventional and open-vocabulary datasets show a clear superiority of RePro for tackling video visual relation detection tasks. • More details about the hyperparameters are given in Sec. A.3. • Analysis of the performance improvement in different predicate groups are in Sec. A.4. • Potential improvements of the motion pattern design are introduced in Sec. A.5. • The detailed experiment settings of the compared SOTA methods are introduced in Sec. A.6. • More experiment results on VidOR are provided at Sec. A.7. • The detailed base/novel split information of object and predicate categories are in Sec. A.8.

A.1 RELATIVE POSITION FEATURE FOR TRACKLET PAIRS

We compute the relative position between bounding boxes of subject-object tracklet pair <T i , T j > by following Shang et al. (2021) . Specifically, we compute the relative position feature between subject-object for the beginning and ending bounding boxes of their temporal intersection. For the beginning bounding boxes of the subject and object, the position feature is calculated as: f B i,j = x i -x j x j , y i -y j y j , log w i w j , log h i h j , log w i h i w j h j , t i -t j L seg . , where (x i , y i ) is the central coordinates of T i 's beginning bounding box, (w i , h i ) is its width and height, and t i is the frame ID of this beginning bounding box. (x j , y j , w j , h j , t j ) is defined similarly for T j . L seg is the number of frames in each video segment, and following Shang et al. (2021) , we set L seg = 30. The relative position feature between the ending bounding boxes of <T i , T j > is defined similarly as f B i,j , and is denoted as f E i,j . The final relative position of <T i , T j > is concatenated as f s i,j = [f B i,j , f E i,j ], and f s i,j ∈ R 12 .

A.2 DETAILS ABOUT THE MOTION PATTERNS

We provide a schematic of the motion patterns defined in Eq. ( 8), as shown in 

A.3 HYPERPARAMETERS

We set φ o , φ p and φ pos all as two-layer MLPs with hidden dimension 768. The λ for weighting the distillation loss (i.e., Eq. ( 5)) was set as 5.0. The prompt length L was set as 10. The softmax temperature τ was set as learnable. The GIoU threshold γ was chosen based on the statistics of the training set, by making the tracklet pairs evenly distributed w.r.t different motion patterns. In our implementation, γ was set as -0.3 for VidVRD and -0.25 for VidOR. We trained our RePro using Adam (Kingma & Ba, 2014) with a learning rate 1e-4, and stopped the training when SGDet mAP drops.

A.4 ANALYSIS OF THE PERFORMANCE IMPROVEMENT IN DIFFERENT PREDICATE GROUPS

We evaluated the Recall@100 in the PredCls setting of some predicate groups (grouped by the prefix of predicate words) at the novel-split of the VidVRD dataset. We compared our RePro with the mean ensemble (Ens) and random select (Rand) variants of RePro (refer to Sec. 4.5). The results show that the improvements of motion-related predicates are much larger than other context-related predicates. For example, we have 9.68% absolute improvements on "run" (e.g., "run past", "run next to"). While for those predicates that can be roughly inferred by the context (e.g., "fly", "swim"), our approach has limited contributions. This indicates that the performance improvements of our RePro are largely attributed to motion cues. 2 are not uniform. The object tracking algorithms for tracklet generation include Seq-NMS (Han et al., 2016) and deepSORT (Wojke et al., 2017) . The features include RoI Aligned features, I3D features (Carreira & Zisserman, 2017) , and improved dense trajectory (iDT) features (Shang et al., 2017) Here we enumerate their details as follows: • 

Methods

Novel-split All-splits SGCls PredCls SGCls PredCls R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 ALPro 3.17 We provide more experiment results for our RePro on the validation set of VidOR, as shown in Table 7 . We first compare our RePro with using ALPro directly perform relation classification as Eq. ( 2). We find that ALPro performs slightly better than RePro on novel-split, because ALPro



Figure 1: Examples of VidVRD. The relation graphs are w.r.t the whole video clip. Dashed lines denote unseen new categories in the training data.

𝟏𝟏 ] [𝒘𝒘 𝟐𝟐 ] ... [𝒘𝒘 𝑳𝑳 ] [away] [𝒔𝒔 𝟏𝟏 ] [𝒔𝒔 𝟐𝟐 ] ... [𝒔𝒔 𝑳𝑳 ] 𝟏𝟏 ] [𝒘𝒘 𝟐𝟐 ] ... [𝒘𝒘 𝑳𝑳 ] [towards] ... Selector [𝒐𝒐 𝟏𝟏 ] [𝒐𝒐 𝟐𝟐 ] ... [𝒐𝒐 𝑳𝑳 ] 𝟏𝟏 ] [𝒔𝒔 𝟐𝟐 ] ... [𝒔𝒔 𝑳𝑳 ] [𝒐𝒐 𝟏𝟏 ] [𝒐𝒐 𝟐𝟐 ] ... [𝒐𝒐 𝑳𝑳 ]

Figure 2: Comparisons of different prompt tuning methods for Open-VidVRD.

to open-vocabulary tracklet detection (Sec. 3.2), and then perform open-vocabulary relation classification for each tracklet pair (Sec. 3.3).

Figure 3: The overall pipeline of RePro. The visual embeddings (v i , v i,j ) are only used at training time, in which we train two V2L project modules (i.e., φ o and φ p ) to transfer the knowledge from the pre-trained VLM. At test time, only RoI features (f i , f i,j ) and position features (f s i,j ) are used. For tracklet detection (a), the knowledge is transferred by aligning v i to v i . For relation classification (b), the knowledge is transferred by the prompt representations learned in the supervision of v i,j . 3.1 PRELIMINARIES: OPEN-VOCABULARY CLASSIFICATION WITH PROMPT Fixed Prompt. Pre-trained VLMs have a strong open-vocabulary classification ability(Li et al.,  2022a). They first extract text embeddings for all categories by feeding handcrafted prompt (e.g., "a video of [CLASS]") into the text encoder of VLM, where [CLASS] can be replaced with the class name of an arbitrary object or predicate. Their output text embedding t c ∈ R d for each class c is

Figure 4: The schematic of the 6 motion patterns defined in Eq. (8).

Su et al. (2020) uses Seq-NMS for tracklets generation, and improved dense trajectory (iDT) feature and relative motion feature of tracklet pairs for relation classification. • Liu et al. (2020) uses deepSORT for tracklets generation, and uses RoI feature, I3D feature, and relative motion feature of tracklet pairs for relation classification. • Li et al. (2021) uses Seq-NMS for tracklets generation, and uses RoI feature and relative motion feature of tracklet pairs for relation classification. • Gao et al. (2022) uses deepSORT for tracklets generation, and uses RoI feature, I3D feature for relation classification. • Our RePro uses Seq-NMS for tracklets generation, and uses RoI feature and relative motion feature of tracklet pairs for relation classification. A.7 MORE EXPERIMENT RESULTS ON VIDORTable 7: Performance (%) on the validation set of VidOR.

3

Performance (%) comparision to conventional methods on VidVRD-test. Relation Tagging (RelTag) only considers the precision of relation triplets and ignores the localization of tracklets.

Performance (%) comparision of Open-VidVRD methods on VidVRD-test.

Ablations (%) for RePro with different prompt design in VidVRD-test, where C stands for Compositional, and M stands for Motion cues. Ens: ensemble all the learned prompts by averaging their representations. Rand: reandomly select a prompt without considering motion cues. Training with compositional prompt, but the prompt is randomly selected from the 6 groups without considering motion cues. For testing, the prompts are ensembled by averaging (Ens) or randomly selected (Rand). #5: The proposed RePro.

Recall@100 (%) of PredCls on the test set of VidVRD w.r.t different predicate groups. .92 12.90 18.30 37.03 35.51 37.50 15.38 Rand 37.93 51.85 16.12 18.30 44.44 36.44 50.00 15.38 RePro 44.82 55.55 25.80 18.95 40.47 41.12 50.00 12.82 A.5 POTENTIAL IMPROVEMENTS OF THE MOTION PATTERN DESIGN The current proposed GIoU-based motion pattern design can be further improved. Based on our proposed motion-pattern based prompt group learning framework, we can design other fancier motion capturing approaches, e.g., automatically learning the motion primitives from the training set. Then for each test sample, the motion pattern can be decomposed as the weighted combination of motion primitives. Consequently we can use the weighted combination of the prompt representations as the desired prompt representation. We leave this as future work. A.6 DETAILED EXPERIMENTAL SETTINGS OF THE COMPARED SOTA METHODS Since VidVRD is a very challenging task, The object tracklets and features used in SOTA methods in Table

acknowledgement

Acknowledgement: This work was supported by the National Key Research & Development Project of China (2021ZD0110700), the National Natural Science Foundation of China (U19B2043, 61976185), and the Fundamental Research Funds for the Central Universities (226-2022-00051). This work was also supported by A*STAR under its AME YIRG grant (Project No. A20E6c0101), and Singapore MOE Tier 2.

availability

//github.com/

ETHICS AND REPRODUCIBILITY STATEMENTS

Ethics statement. The open-vocabulary video visual relation detection (Open-VidVRD) that we introduced in this paper is a general extension of conventional VidVRD, and there are no known extra ethical issues in terms of the Open-VidVRD task and the proposed RePro model. As for the pre-trained visual-language model (VLM) used for Open-VidVRD, the large-scale pre-training data might contain some videos and captions involved with discrimination/bias issues. When applying pre-trained VLM to Open-VidVRD, the model tends to predict relations based more on the pre-trained knowledge, and focus less on the visual cues. For example, when the pre-trained data involved with unethical videos and captions, a model might predict "person-punch-dog" given a video of person caressing dog, which implies a person is abusing animals. To avoid the potential ethical issues, we can design algorithms to filter out those unethical training data for VLMs. For Open-VidVRD models, we can also introduce some common sense knowledge and design some rule-based methods to filter out those unreasonable relation triplets that involve ethical issues.Reproducibility Statement. Our RePro is mainly implemented based on the realsed code of AL-Pro (Li et al., 2022a) , VinVL (Zhang et al., 2021) , and VidVRD-II (Shang et al., 2021) . We first modified the code of VinVL to fit the video data and to extract object tracklets in each video. Then We modified the code of VidVRD-II to be compatible with the visual and text encoder of ALPro, and to fit the Open-VidVRD setting. We provide the the detailed base/novel split information of object and predicate categories in the Appendix (Sec. A.8) to ensure all experiments can be reproduced.When training the RePro model and its variants, we manually set the random seed and fixed the seed for all experiments to ensure they can be reproduced. We also provide the code of our RePro model and the training/evaluate scripts in the supplementary materials.Published as a conference paper at ICLR 2023 doesn't have the trend of fitting base categories. However, ALPro performs much worse than Re-Pro in All-splits due to not trained on base categories. Furthermore, we also compare RePro with the VidVRD-II (Shang et al., 2021) baseline and the variant RePro † . We can observe that RePro outperfoms both VidVRD-II and RePro † by a large margin on both novel-split and all-split.

A.8 DETAILED BASE/NOVEL CATEGORIES OF OBJECT AND PREDICATE FOR VIDVRD AND VIDOR

We list the base/novel categories of object and predicate for training and evaluating our RePro and other baselines in all experiments. We also provide more statistics information in the supplementary materialsVidVRD Object 25 base object categories:"airplane", "bicycle", "bird", "bus", "car", "dog", "domestic cat", "elephant", "hamster", "lion", "monkey", "rabbit", "sheep", "snake", "squirrel", "tiger", "train", "turtle", "whale", "zebra", "ball", "frisbee", "sofa", "skateboard", "person" 10 novel object categories "horse", "watercraft", "giant panda", "fox", "red panda", "cattle", "motorcycle", "bear", "antelope", "lizard"VidVRD Predicate 71 base predicate categories:"behind", "chase", "creep behind", "creep beneath", "creep front", "creep left", "creep right", "fall off", "faster", "fly above", "fly next to", "fly past", "fly toward", "fly with", "follow", "front", "jump beneath", "jump front", "jump left", "jump next to", "jump right", "jump toward", "larger", "left", "lie behind", "lie front", "lie left", "lie next to", "lie right", "move behind", "move beneath", "move front", "move left", "move right", "move with", "next to", "play", "ride", "right", "run behind", "run front", "run left", "run past", "run right", "run with", "sit above", "sit front", "sit left", "sit right", "stand behind", "stand front", "stand left", "stand next to", "stand right", "stop behind", "stop front", "stop left", "stop right", "swim front", "swim left", "swim right", "swim with", "taller", "touch", "walk behind", "walk front", "walk left", "walk next to", "walk right", "walk with", "watch" 61 novel predicate categories:"above", "away", "beneath", "bite", "creep above", "creep away", "creep next to", "creep past", "creep toward", "drive", "feed", "fight", "fly away", "fly behind", "fly front", "fly left", "fly right", "hold", "jump above", "jump away", "jump behind", "jump past", "jump with", "kick", "lie above", "lie beneath", "lie inside", "lie with", "move above", "move away", "move next to", "move past", "move toward", "past", "pull", "run above", "run away", "run beneath", "run next to", "run toward", "sit behind", "sit beneath", "sit inside", "sit next to", "stand above", "stand beneath", "stand inside", "stand with", "stop above", "stop beneath", "stop next to", "stop with", "swim behind", "swim beneath", "swim next to", "toward", "walk above", "walk away", "walk beneath", "walk past", "walk toward"VidOR Object 50 base object categories:"adult", "child", "toy", "dog", "baby", "car", "chair", "table", "sofa", "ball/sports ball", "screen/monitor", "cup", "bicycle", "guitar", "bottle", "backpack", "handbag", "baby seat", "camera", "cat", "cellphone", "bird", "sheep/goat", "laptop", "ski", "stool", "watercraft", "duck", "bus/truck", "bench", "fruits", "baby walker", "horse", "bat", "dish",

