EVIDENTIAL UNCERTAINTY AND DIVERSITY GUIDED ACTIVE LEARNING FOR SCENE GRAPH GENERATION

Abstract

Scene Graph Generation (SGG) has already shown its great potential in various downstream tasks, but it comes at the price of a prohibitively expensive annotation process. To reduce the annotation cost, we propose using Active Learning (AL) for sampling the most informative data. However, directly porting current AL methods to the SGG task poses the following challenges: 1) unreliable uncertainty estimates and 2) data bias problems. To deal with these challenges, we propose EDAL (Evidential Uncertainty and Diversity Guided Deep Active Learning), a novel AL framework tailored for the SGG task. For challenge 1), we start with Evidential Deep Learning (EDL) coupled with a global relationship mining approach to estimate uncertainty, which can effectively overcome the perturbations of open-set relationships and background-relationships to obtain reliable uncertainty estimates. To address challenge 2), we seek the diversity-based method and design the Context Blocking Module and Image Blocking Module to alleviate context-level bias and image-level bias, respectively. Experiments show that our AL framework can approach the performance of a fully supervised SGG model with only about 10% annotation cost. Furthermore, our ablation studies indicate that introducing AL into the SGG will face many challenges not observed in other vision tasks that are successfully overcome by our new modules.

1. INTRODUCTION

Scene Graph Generation (SGG) (Johnson et al., 2015) aims at generating a structured representation of a scene that jointly describes objects and their attributes, as well as their pairwise relationships. SGG has attracted significant attention as it provides rich semantic relationships of the visual scenes and has great potential for improving various other vision tasks, such as object detection (Ren et al., 2015; Redmon et al., 2016) , image search (Gong et al., 2012; Noh et al., 2017) , and visual question answering (Antol et al., 2015; Zhu et al., 2016) . Albeit being an emerging area of research, which can bridge the gap between computer vision and natural language processing, SGG is still underexplored despite many recent works focusing on SGG (Chang et al., 2021; Zhu et al., 2022) . The main challenges that impede the advancement of SGG are twofold. On the one hand, existing datasets for SGG (Krishna et al., 2017; Lu et al., 2016) suffer from many serious issues, such as longtailed distribution, noisy and missing annotations, which makes it difficult to supervise a satisfactory model. On the other hand, existing deep learning-based SGG methods are data hungry, requiring tens or hundreds of labeled samples. However, acquiring high-quality labeled data can be very costly, which is especially the case for SGG. The reason for this is that SGG involves labeling visual <subject, relationship, object> triplets (e.g., <people, ride, bike>) over entity and relationship classes in an image, which can be difficult and time consuming (Yang et al., 2021; Shi et al., 2021; Guo et al., 2021) . Therefore, it is highly desirable to minimize the number of labeled samples needed to train a well-performing model. Active Learning (AL) provides a solid framework to mitigate this problem (Yoo & Kweon, 2019; Kirsch et al., 2019; Huang et al., 2010; 2021) . It is, therefore, natural to investigate whether AL can be used to save labeling costs while maintaining accuracy, which is the focus of this paper. In AL, the model selects the most informative examples from an unlabeled pool according to some criteria for manual labeling, and then the model is retrained and evaluated with the selected examples. This looks intuitive yet simple, but directly transferring existing AL methods to the SGG task will face several challenges. 1

