EVIDENTIAL UNCERTAINTY AND DIVERSITY GUIDED ACTIVE LEARNING FOR SCENE GRAPH GENERATION

Abstract

Scene Graph Generation (SGG) has already shown its great potential in various downstream tasks, but it comes at the price of a prohibitively expensive annotation process. To reduce the annotation cost, we propose using Active Learning (AL) for sampling the most informative data. However, directly porting current AL methods to the SGG task poses the following challenges: 1) unreliable uncertainty estimates and 2) data bias problems. To deal with these challenges, we propose EDAL (Evidential Uncertainty and Diversity Guided Deep Active Learning), a novel AL framework tailored for the SGG task. For challenge 1), we start with Evidential Deep Learning (EDL) coupled with a global relationship mining approach to estimate uncertainty, which can effectively overcome the perturbations of open-set relationships and background-relationships to obtain reliable uncertainty estimates. To address challenge 2), we seek the diversity-based method and design the Context Blocking Module and Image Blocking Module to alleviate context-level bias and image-level bias, respectively. Experiments show that our AL framework can approach the performance of a fully supervised SGG model with only about 10% annotation cost. Furthermore, our ablation studies indicate that introducing AL into the SGG will face many challenges not observed in other vision tasks that are successfully overcome by our new modules.

1. INTRODUCTION

Scene Graph Generation (SGG) (Johnson et al., 2015) aims at generating a structured representation of a scene that jointly describes objects and their attributes, as well as their pairwise relationships. SGG has attracted significant attention as it provides rich semantic relationships of the visual scenes and has great potential for improving various other vision tasks, such as object detection (Ren et al., 2015; Redmon et al., 2016) , image search (Gong et al., 2012; Noh et al., 2017) , and visual question answering (Antol et al., 2015; Zhu et al., 2016) . Albeit being an emerging area of research, which can bridge the gap between computer vision and natural language processing, SGG is still underexplored despite many recent works focusing on SGG (Chang et al., 2021; Zhu et al., 2022) . The main challenges that impede the advancement of SGG are twofold. On the one hand, existing datasets for SGG (Krishna et al., 2017; Lu et al., 2016) suffer from many serious issues, such as longtailed distribution, noisy and missing annotations, which makes it difficult to supervise a satisfactory model. On the other hand, existing deep learning-based SGG methods are data hungry, requiring tens or hundreds of labeled samples. However, acquiring high-quality labeled data can be very costly, which is especially the case for SGG. The reason for this is that SGG involves labeling visual <subject, relationship, object> triplets (e.g., <people, ride, bike>) over entity and relationship classes in an image, which can be difficult and time consuming (Yang et al., 2021; Shi et al., 2021; Guo et al., 2021) . Therefore, it is highly desirable to minimize the number of labeled samples needed to train a well-performing model. Active Learning (AL) provides a solid framework to mitigate this problem (Yoo & Kweon, 2019; Kirsch et al., 2019; Huang et al., 2010; 2021) . It is, therefore, natural to investigate whether AL can be used to save labeling costs while maintaining accuracy, which is the focus of this paper. In AL, the model selects the most informative examples from an unlabeled pool according to some criteria for manual labeling, and then the model is retrained and evaluated with the selected examples. This looks intuitive yet simple, but directly transferring existing AL methods to the SGG task will face several challenges. First, existing batch query-based AL paradigms (Gudovskiy et al., 2020; Kim et al., 2021; Mahmood et al., 2021; Sener & Savarese, 2017; Tan et al., 2021) used for SGG face a large number of openset relationships, i.e., relationships that appear in the unlabeled pool but are absent in labeled data, mainly because of the severe long-tailed distribution of the SGG relationships (Zellers et al., 2018; Tang et al., 2020a; b) . We observe that existing uncertainty estimation approaches perform badly in classifying SGG relationships, especially open-set relationships. Inspired by Evidential Deep Learning (EDL) (Sensoy et al., 2018) and its advanced performance in open-set action recognition (Bao et al., 2021) , we enhance and incorporate it into our proposed AL framework to estimate the relationship uncertainty. Second, the relationship annotations in SGG dataset are very sparse, resulting in severe foreground-background imbalance (Xu et al., 2017; Goel et al., 2022) . Foregroundrelationships are those within annotated triplets in the dataset, while background-relationships are the ones that are absent between object pairs. Due to the large number of background-relationships, they can perturb or even dominate the uncertainty estimation. To this end, we propose a relationship mining module, Relationship Proposal Graph (RPG), as a part of the uncertainty estimation, which works by filtering out background-relationships to refine the uncertainty obtained by EDL. Third, despite our improved EDL having the capability to generate reliable estimates of relationship uncertainty, its sampling results are still vulnerable to the problems of traditional uncertainty-based AL methods, i.e., data bias problems (Kim et al., 2021; Shen et al., 2017; Luo et al., 2013) . More importantly, we also found that uncertainty-based AL used for SGG will be biased at both contextlevel and image-level, where the context in SGG refers to the feature space formed by relationship triplets. For this issue, we design the Context Blocking Module (CBM) and the Image Blocking Module (IBM), which are inspired by diversity-based AL methods. The former can block similar contexts to avoid the context-level bias, while the latter can block redundant images to eliminate the image-level bias. Contributions. The main contributions of this work are the following: (1) We carry out a pioneering study of using AL for SGG to achieve label efficiency without significantly sacrificing performance loss and propose a novel framework dubbed Evidential Uncertainty and Diversity Guided Deep Active Learning (EDAL). (2) In the proposed EDAL framework, we introduce novel evidential uncertainty to guide deep active learning and efficient one-shot estimation of relationship uncertainty. In this process, a relationship mining module is designed to avoid the perturbation of uncertainty estimation by background-relationships. In order to effectively mitigate context-level and image-level bias problems induced by AL, we design two modules, CBM and IBM. (3) Extensive experimental results on the SGG benchmarks demonstrate that EDAL can significantly save human annotation costs, approaching the performance of a fully supervised model with only about 10% labeling cost.

2. RELATED WORK

Scene Graph Generation (SGG). SGG extracts a structured representation of the scene by assigning appropriate relationships to object pairs and enables a more comprehensive understanding of the scene for intelligent agents (Johnson et al., 2015; Lu et al., 2016; Krishna et al., 2017; Liu et al., 2021; Yin et al., 2018) . For supervised training of the SGG task, a massive amount of triplets within images in the form of <subject, relation, object> need to be provided, which involves several sub-tasks including object detection, object recognition and relationship description, and results in an unaffordable annotation cost. To mitigate this, (Chen et al., 2019) proposed a semi-supervised method for SGG, which requires only a small amount of labeled data for each relationship and generates pseudo-labels for the remaining samples using image-agnostic features. However, these pseudo-labels tend to converge to a few dominant relationships. (Ye & Kovashka, 2021) designed a weak supervision framework to reduce the reliance on labor-intensive annotations with the help of linguistic structures. Recently, (Yao et al., 2021 ) trained an SGG model in an unsupervised manner by drawing on knowledge bases extracted from the triplets of web-scale image captions. Despite showing the promise of label efficient learning techniques in SGG, the above caption-based methods rely on large-scale external linguistic knowledge which fits the target scene. This, to some extent, limits its generalization to other scenes without adequate linguistic priors. We explore an alternative approach and propose a hybrid AL framework tailored to the SGG task in order to avoid the expensive labeling cost without access to external knowledge. Active Learning (AL). AL aims to select the most informative data from the unlabeled pool for annotation to support model training. In vision tasks such as image classification and object detec-

