ASK QUESTION WITH DOUBLE HINTS: VISUAL QUES-TION GENERATION WITH ANSWER-AWARENESS AND REGION-REFERENCE

Abstract

The task of visual question generation (VQG) aims to generate human-like questions from an image and potentially other side information (e.g. answer type or the answer itself). Despite promising results have been achieved, previous works on VQG either i) suffer from one image to many questions mapping problem rendering the failure of generating referential and meaningful questions from an image, or ii) ignore rich correlations among the visual objects in an image and potential interactions between the side information and image. To address these limitations, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. In particular, we aim to ask the right visual questions with Double Hints -textual answers and visual regions of interests, effectively mitigating the existing one-to-many mapping issue. To this end, we develop a simple methodology to self-learn the visual hints without introducing any additional human annotations. Furthermore, to capture these sophisticated relationships, we propose a new double-hints guided Graph-to-Sequence learning framework that first models them as a dynamic graph and learns the implicit topology end-to-end, and then utilize a graph-to-sequence model to generate the questions with double hints. Our experiments on VQA2.0 and COCO-QA datasets demonstrate that our proposed model on this new setting can significantly outperform existing state-of-the-art baselines by a large margin.

1. INTRODUCTION

Visual Question Generation (VQG) is an emerging task in both computer vision (CV) and natural language processing (NLP) fields, which aims to generate human-like questions from an image and potentially other side information (e.g. answer type or answer itself). Recent years have seen a surge of interests in VQG because it is particularly useful for providing high-quality synthetic training data for visual question answering (VQA) (Li et al., 2018 ) and visual dialog system (Jain et al., 2018) . Conceptually, it is a challenging task because the generated questions are not only required to be consistent with the image content but also meaningful and answerable to human. Despite promising results have been achieved, previous works still encounter two major issues. First, all of existing methods significantly suffer from one image to many questions mapping problem rendering the failure of generating referential and meaningful questions from an image. The existing VQG methods can be generally categorized into three classes with respect to what hints are used for generating visual questions: 1) the whole image as the only context input (Mora et al., 2016) ; 2) the whole image and the desired answers (Li et al., 2018) ; 3) the whole image with the desired answer types (Krishna et al., 2019) . Since a picture is worth a thousand words, it can be potentially mapped to many different questions, leading to the generation of diverse non-informative questions with poor quality. Even with the answer type or desired answer information, the similar one-to-many mapping issue remains, partially because the answer hints are often very short or too broad. As a result, these side information are often not informative enough for guiding question generation process, rendering the failure of generating referential and meaningful questions from an image. The second severe issue for the existing VQG methods is that they ignore the rich correlations among the visual objects in an image and potential interactions between the side information and image (Krishna et al., 2019) . Conceptually, the implicit relations among the visual objects (e.g., spatial, semantic) could be the key to generate meaningful and high-quality questions. This is partially because when human annotators ask questions about a given image, they often focus on these kinds of interactions. In addition, another important factor for producing informative and referential questions is about how to make full use of side information to align with the targeted image. Modeling such potential interactions between the side information and an image becomes a critical component for generating referential and meaningful questions. To address these aforementioned issues, in this paper, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. More specifically, we aim to utilize the referential visual regions of interest hints (denoted as visual hints for simplicity) of the images and the textual answers (denoted as answer hints) to faithfully guide question generation. As illustrated in Figure 1 , by giving an image with visual hints (the region enclosed by the orange rectangle) and answer hints (the answer), the model is able to faithfully generate the right question with key entities that reflects the visual hints and answerable to the answer hints. To this end, in order to learn these visual hints, we develop a multi-task auto-encoder to learn the visual hints and the unique attributes automatically without introducing any additional human annotations. Furthermore, to capture the rich interactions between visual and answer hints and the image as well as the sophisticated relationships among the visual objects in an image, we propose a new Double-Hints guided Graph-to-Sequence learning framework (DH-Graph2Seq). The proposed model first models these interactions as a dynamic graph and learns the implicit topology end-to-end, and then utilize a Graph2Seq model to generate the questions with double hints. In addition, in the decoder side, we also present a visual-hint guided separate attention mechanism to attend image and object graph separately and overlook the non-visual-hints particularly. In summary, we highlight our main contributions as follows: • We propose a novel learning paradigm to generate visual questions with Double Hintstextual answer and visual regions of interests, which could effectively mitigate the existing one-to-many mapping issue. To the best of our knowledge, this is the first time both visual hints and answers hints are used for the VQG task. • We explicitly cast the VQG task as a Graph-to-Sequence (Graph2Seq) learning problem. We employ graph learning technique to learn the implicit graph topology to capture various rich interactions between and within an image, and then utilize a Graph2Seq model to guide question generation with double hints. • Our extensive experiments on VQA2.0 and COCO-QA datasets demonstrate that our proposed model can significantly outperform existing state-of-the-art by a large margin.



Figure 1: The overall framework of our proposed model with double hints to guide VQG.

