CWATR: GENERATING RICHER CAPTIONS WITH OB-JECT ATTRIBUTES

Abstract

Image captioning is a popular yet challenging task which is at the intersection of Computer Vision and Natural Language Processing. Recently, transformer-based unified Vision and Language models advanced the state-of-the-art further on image captioning. However, there are still fundamental problems in these models. Even though the generated captions by these models are grammatically correct and describe the input image fairly good, they might overlook important details in the image. In this paper, we demonstrate these problems in a state-of-the-art baseline image captioning method and analyze the reasoning behind these problems. We propose a novel approach, named CWATR (Captioning With ATtRibutes), to integrate object attributes to the generated captions in order to obtain richer and more detailed captions. Our analyses demonstrate that the proposed approach generates richer and more visually grounded captions by integrating attributes of the objects in the scene to the generated captions successfully.

1. INTRODUCTION

With the recent advancements in Computer Vision (CV) and Natural Language Processing (NLP), machines can understand and respond to visual or textual data, and establish relationships between these two modalities. Among several studies working on these two modalities, image captioning aims to generate grammatically and semantically meaningful sentences describing the given input image like the humans do. A good caption should be grammatically correct, natural sounding, rich, and grounded on the image (Stefanini et al., 2022 ), (Rohrbach et al., 2018 ), (Zhou et al., 2020b) . Design of large transformer-based models (Vaswani et al., 2017) and utilization of large datasets (Chen et al., 2015; Sharma et al., 2018; Ordonez et al., 2011 ), (Young et al., 2014) , have led to a significant improvement in state-of-the-art in image captioning. OSCAR (Li et al., 2020) and VIVO (Hu et al., 2021) achieved very good results in general image captioning (Chen et al., 2015) and novel object captioning (Agrawal et al., 2019) . VinVL (Zhang et al., 2021) further improved those two and achieved state-of-the-art by utilizing richer regional features. Even though theoretical evaluation of state-of-the-art methods results in high scores, there are overlooked problems in these models. These problems arise when actual captioning outputs are examined in detail. There are studies (Yang et al., 2019; Ma et al., 2020) demonstrating that image captioning models are inclined towards copying phrases from training dataset without paying attention to the input image. Furthermore, these models might hallucinate non-existing objects in the image or overlook important details (Yang et al., 2019; Rohrbach et al., 2018) . Our observations in this study are also in parallel with those findings. The results show that the recent captioning models overlook some aspects of the scene. Most of the time, the generated captions lack details of objects in the scene. An example of such a case is demonstrated in Figure 1 . In this example, the caption generated by VIVO (Visual Vocabulary Pretraining) with VinVL features (Hu et al., 2021; Zhang et al., 2021) hallucinates a chair. It also overlooks important details in the image such as the fence of the garden and the car in the background. Moreover it does not mention about properties of the objects in the scene, such as "small garden, red car". In this paper, we attack this problem and propose a novel approach in order to generate richer captions with additional object attribute information. More precisely, contributions of this paper are as follows:

