ASK QUESTION WITH DOUBLE HINTS: VISUAL QUES-TION GENERATION WITH ANSWER-AWARENESS AND REGION-REFERENCE

Abstract

The task of visual question generation (VQG) aims to generate human-like questions from an image and potentially other side information (e.g. answer type or the answer itself). Despite promising results have been achieved, previous works on VQG either i) suffer from one image to many questions mapping problem rendering the failure of generating referential and meaningful questions from an image, or ii) ignore rich correlations among the visual objects in an image and potential interactions between the side information and image. To address these limitations, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. In particular, we aim to ask the right visual questions with Double Hints -textual answers and visual regions of interests, effectively mitigating the existing one-to-many mapping issue. To this end, we develop a simple methodology to self-learn the visual hints without introducing any additional human annotations. Furthermore, to capture these sophisticated relationships, we propose a new double-hints guided Graph-to-Sequence learning framework that first models them as a dynamic graph and learns the implicit topology end-to-end, and then utilize a graph-to-sequence model to generate the questions with double hints. Our experiments on VQA2.0 and COCO-QA datasets demonstrate that our proposed model on this new setting can significantly outperform existing state-of-the-art baselines by a large margin.

1. INTRODUCTION

Visual Question Generation (VQG) is an emerging task in both computer vision (CV) and natural language processing (NLP) fields, which aims to generate human-like questions from an image and potentially other side information (e.g. answer type or answer itself). Recent years have seen a surge of interests in VQG because it is particularly useful for providing high-quality synthetic training data for visual question answering (VQA) (Li et al., 2018 ) and visual dialog system (Jain et al., 2018) . Conceptually, it is a challenging task because the generated questions are not only required to be consistent with the image content but also meaningful and answerable to human. Despite promising results have been achieved, previous works still encounter two major issues. First, all of existing methods significantly suffer from one image to many questions mapping problem rendering the failure of generating referential and meaningful questions from an image. The existing VQG methods can be generally categorized into three classes with respect to what hints are used for generating visual questions: 1) the whole image as the only context input (Mora et al., 2016) ; 2) the whole image and the desired answers (Li et al., 2018) ; 3) the whole image with the desired answer types (Krishna et al., 2019) . Since a picture is worth a thousand words, it can be potentially mapped to many different questions, leading to the generation of diverse non-informative questions with poor quality. Even with the answer type or desired answer information, the similar one-to-many mapping issue remains, partially because the answer hints are often very short or too broad. As a result, these side information are often not informative enough for guiding question generation process, rendering the failure of generating referential and meaningful questions from an image. The second severe issue for the existing VQG methods is that they ignore the rich correlations among the visual objects in an image and potential interactions between the side information and 1

