VIDEO SCENE GRAPH GENERATION FROM SINGLE-FRAME WEAK SUPERVISION

Abstract

Video scene graph generation (VidSGG) aims to generate a sequence of graphstructure representations for the given video. However, all existing VidSGG methods are fully-supervised, i.e., they need dense and costly manual annotations. In this paper, we propose the first weakly-supervised VidSGG task with only singleframe weak supervision: SF-VidSGG. By "weakly-supervised", we mean that SF-VidSGG relaxes the training supervision from two different levels: 1) It only provides single-frame annotations instead of all-frame annotations. 2) The singleframe ground-truth annotation is still a weak image SGG annotation, i.e., an unlocalized scene graph. To solve this new task, we also propose a novel Pseudo Label Assignment based method, dubbed as PLA. PLA is a two-stage method, which generates pseudo visual relation annotations for the given video at the first stage, and then trains a fully-supervised VidSGG model with these pseudo labels. Specifically, PLA consists of three modules: an object PLA module, a predicate PLA module, and a future predicate prediction (FPP) module. Firstly, in the object PLA, we localize all objects for every frame. Then, in the predicate PLA, we design two different teachers to assign pseudo predicate labels. Lastly, in the FPP module, we fusion these two predicate pseudo labels by the regularity of relation transition in videos. Extensive ablations and results on the benchmark Action Genome have demonstrated the effectiveness of our PLA 1 .

1. INTRODUCTION

Scene graph (Johnson et al., 2015) is a type of visually-aligned graph-structured representation that summarizes all the object instances (e.g., "person", "chair") as nodes and their pairwise visual relations (or predicates, e.g., "sitting on") as directed edges. As a bridge to connect the vision and language modalities, scene graphs have been widely used in many different downstream visuallanguage tasks, such as visual captioning (Yang et al., 2019; 2020) , grounding (Jing et al., 2020) , question answering (Hudson & Manning, 2019), and retrieval (Johnson et al., 2015) . Early Scene Graph Generation (SGG) work mainly focuses on generating scene graphs for the given image, dubbed as ImgSGG (Xu et al., 2017; Zellers et al., 2018; Chen et al., 2019) . However, due to its static nature, ImgSGG fails to represent numerous dynamic visual relations that take place over a period of time, such as "walking with" and "running away" (vs. static relation "standing"). Meanwhile, it is hard or impossible to identify these dynamic visual relations with only a single frame, because these visual relations can only be well classified by considering the temporal context. Therefore, another more meaningful but challenging video-based SGG task was proposed: VidSGG (Shang et al., 2017; 2019) . Since the complex and dense annotations of a scene graph (cf., Figure 1 (a)), fully-supervised SGG methods always require lots of manual annotations, and the case is even worse for video data. Meanwhile, several prior SGG works (Li et al., 2022a) have found that even carefully manually-annotated labels are still quite noisy, i.e., the annotated positive labels may not be the best matched, and numerous negative labels are just missing annotated. Thus, a surge of recent ImgSGG work (Zareian et al., 2020; Zhong et al., 2021; Shi et al., 2021; Li et al., 2022c) start to generate scene graphs for images with only weak supervision. By "weak supervision", we mean that the annotations for model training are not complete localized scene graphs. For example, a typical type of weak supervision is unlocalized scene graphs. As illustrated in Figure 1 Although recent weakly-supervised ImgSGG has achieved good performance and received unprecedented attention, to the best our knowledge, there is no existing work about generating video scene graphs from weak supervision. To put forward the research on this critical topic, we propose the first weakly-supervised VidSGG task with single-frame weak supervision, called SF-VidSGG. Specifically, given an input video, SF-VidSGG aims to generate a localized scene graph for each frame in the video, but the only supervision for training is an unlocalized scene graph for the middle frame of each training video. As the example shown in Figure 1 (c), the supervision is an unlocalized scene graph for the third frame. Obviously, SF-VidSGG task tries to relieve the intensive annotation issues from two levels: 1) Video-level: For each video, we only need single-frame annotations instead of all-frame annotations as the fullysupervised setting (i.e., reduce the number of annotated frames). 2) Frame-level: The single-frame annotation is an unlocalized scene graph (i.e., avoid annotating object locations). A straightforward solution for SF-VidSGG is: Using all the weakly annotated frames to train a weakly-supervised ImgSGG model first, and then detecting scene graphs on each frame with the ImgSGG model. Apparently, this naive ImgSGG-based method has overlooked the temporal context in the video data. To this end, we propose a novel Pseudo Label Assignment strategy PLA, which can serve as the first baseline for SF-VidSGG. Since PLA is agnostic to different VidSGG architectures, it can be easily incorporated into any advanced VidSGG model. Specifically, PLA decouples the problem into two steps: The first step is to assign a pseudo localized scene graph for every frame in the video, and the second step is to train a fully-supervised VidSGG model by the pseudo localized scene graphs. PLA consists of three modules: object pseudo label assignment module (Obj-PLA), relation pseudo label assignment module (Rel-PLA), and future predicate prediction module (FPP). In the Obj-PLA module, we detect object region proposals for all the frames. In the Rel-PLA module, we propose two relation pseudo label assignment teachers and they generate two different pseudo labels for each frame. In the FPP module, we determine adapted weights to fuse these labels from two different teachers. To effectively obtain optimal adapted weights for fusing different teacher knowledge, the FPP module exploits the temporal context based on the relation transition in videos. The relation transition means how the predicates change between the same subject-object pairs in different frames. In summary, we make three main technique contributions in this paper: 1. We propose the first weakly-supervised VidSGG task. Compared to its fully-supervised counterpart, we try to mitigate the intensive annotations from both video-level and frame-level. 2. We propose a novel method PLA for SF-VidSGG. It utilizes two teachers to assign pseudo label for unlabeled data and refines the pseudo labels from both teachers by knowledge distillation. 3. We propose a future predicate prediction module that leverages temporal dependencies in video.

2. RELATED WORK

Image Scene Graph Generation (ImgSGG). ImgSGG aims to generate semantic graph structures -scene graphs -as the representation of images. In each scene graph, every node represents an object instance and every edge represents a visual relation between two objects. Early ImgSGG methods always directly predict all pairwise visual relations (Lu et al., 2016; Zhang et al., 2017) .



Figure 1: (a) Localized scene graph: It consists of all object bboxes, object categories, and predicate categories. (b): Unlocalized scene graph: It consists of object and predicate categories without object bboxes. (c) The supervision for SF-VidSGG, which only provides an unlocalized scene graph for the middle frame.

