VIDEO SCENE GRAPH GENERATION FROM SINGLE-FRAME WEAK SUPERVISION

Abstract

Video scene graph generation (VidSGG) aims to generate a sequence of graphstructure representations for the given video. However, all existing VidSGG methods are fully-supervised, i.e., they need dense and costly manual annotations. In this paper, we propose the first weakly-supervised VidSGG task with only singleframe weak supervision: SF-VidSGG. By "weakly-supervised", we mean that SF-VidSGG relaxes the training supervision from two different levels: 1) It only provides single-frame annotations instead of all-frame annotations. 2) The singleframe ground-truth annotation is still a weak image SGG annotation, i.e., an unlocalized scene graph. To solve this new task, we also propose a novel Pseudo Label Assignment based method, dubbed as PLA. PLA is a two-stage method, which generates pseudo visual relation annotations for the given video at the first stage, and then trains a fully-supervised VidSGG model with these pseudo labels. Specifically, PLA consists of three modules: an object PLA module, a predicate PLA module, and a future predicate prediction (FPP) module. Firstly, in the object PLA, we localize all objects for every frame. Then, in the predicate PLA, we design two different teachers to assign pseudo predicate labels. Lastly, in the FPP module, we fusion these two predicate pseudo labels by the regularity of relation transition in videos. Extensive ablations and results on the benchmark Action Genome have demonstrated the effectiveness of our PLA 1 .

1. INTRODUCTION

Scene graph (Johnson et al., 2015) is a type of visually-aligned graph-structured representation that summarizes all the object instances (e.g., "person", "chair") as nodes and their pairwise visual relations (or predicates, e.g., "sitting on") as directed edges. As a bridge to connect the vision and language modalities, scene graphs have been widely used in many different downstream visuallanguage tasks, such as visual captioning (Yang et al., 2019; 2020 ), grounding (Jing et al., 2020)) , question answering (Hudson & Manning, 2019), and retrieval (Johnson et al., 2015) . Early Scene Graph Generation (SGG) work mainly focuses on generating scene graphs for the given image, dubbed as ImgSGG (Xu et al., 2017; Zellers et al., 2018; Chen et al., 2019) . However, due to its static nature, ImgSGG fails to represent numerous dynamic visual relations that take place over a period of time, such as "walking with" and "running away" (vs. static relation "standing"). Meanwhile, it is hard or impossible to identify these dynamic visual relations with only a single frame, because these visual relations can only be well classified by considering the temporal context. Therefore, another more meaningful but challenging video-based SGG task was proposed: VidSGG (Shang et al., 2017; 2019) . Since the complex and dense annotations of a scene graph (cf., Figure 1 (a)), fully-supervised SGG methods always require lots of manual annotations, and the case is even worse for video data. Meanwhile, several prior SGG works (Li et al., 2022a) have found that even carefully manually-annotated labels are still quite noisy, i.e., the annotated positive labels may not be the best matched, and numerous negative labels are just missing annotated. Thus, a surge of recent ImgSGG work (Zareian

