CWATR: GENERATING RICHER CAPTIONS WITH OB-JECT ATTRIBUTES

Abstract

Image captioning is a popular yet challenging task which is at the intersection of Computer Vision and Natural Language Processing. Recently, transformer-based unified Vision and Language models advanced the state-of-the-art further on image captioning. However, there are still fundamental problems in these models. Even though the generated captions by these models are grammatically correct and describe the input image fairly good, they might overlook important details in the image. In this paper, we demonstrate these problems in a state-of-the-art baseline image captioning method and analyze the reasoning behind these problems. We propose a novel approach, named CWATR (Captioning With ATtRibutes), to integrate object attributes to the generated captions in order to obtain richer and more detailed captions. Our analyses demonstrate that the proposed approach generates richer and more visually grounded captions by integrating attributes of the objects in the scene to the generated captions successfully.

1. INTRODUCTION

With the recent advancements in Computer Vision (CV) and Natural Language Processing (NLP), machines can understand and respond to visual or textual data, and establish relationships between these two modalities. Among several studies working on these two modalities, image captioning aims to generate grammatically and semantically meaningful sentences describing the given input image like the humans do. A good caption should be grammatically correct, natural sounding, rich, and grounded on the image (Stefanini et al., 2022 ), (Rohrbach et al., 2018 ), (Zhou et al., 2020b) . Design of large transformer-based models (Vaswani et al., 2017) and utilization of large datasets (Chen et al., 2015; Sharma et al., 2018; Ordonez et al., 2011 ), (Young et al., 2014) , have led to a significant improvement in state-of-the-art in image captioning. OSCAR (Li et al., 2020) and VIVO (Hu et al., 2021) achieved very good results in general image captioning (Chen et al., 2015) and novel object captioning (Agrawal et al., 2019 ). VinVL (Zhang et al., 2021) further improved those two and achieved state-of-the-art by utilizing richer regional features. Even though theoretical evaluation of state-of-the-art methods results in high scores, there are overlooked problems in these models. These problems arise when actual captioning outputs are examined in detail. There are studies (Yang et al., 2019; Ma et al., 2020) demonstrating that image captioning models are inclined towards copying phrases from training dataset without paying attention to the input image. Furthermore, these models might hallucinate non-existing objects in the image or overlook important details (Yang et al., 2019; Rohrbach et al., 2018) . Our observations in this study are also in parallel with those findings. The results show that the recent captioning models overlook some aspects of the scene. Most of the time, the generated captions lack details of objects in the scene. An example of such a case is demonstrated in Figure 1 . In this example, the caption generated by VIVO (Visual Vocabulary Pretraining) with VinVL features (Hu et al., 2021; Zhang et al., 2021) hallucinates a chair. It also overlooks important details in the image such as the fence of the garden and the car in the background. Moreover it does not mention about properties of the objects in the scene, such as "small garden, red car". In this paper, we attack this problem and propose a novel approach in order to generate richer captions with additional object attribute information. More precisely, contributions of this paper are as follows: A garden with a chair and a plant on the ground. 

2. RELATED WORK

Initial image captioning approaches were based on template filling (Kulkarni et al., 2011; Yao et al., 2010) . Predefined sentence templates were filled with predicted object names, attributes, and prepositions. Rapid progress in deep learning field also influenced the image captioning research and deep learning based approaches were proposed (Vinyals et al., 2015; Karpathy & Fei-Fei, 2015) . These approaches utilized a Convolutional Neural Network (CNN) as the image encoder and a Recurrent Neural Network (RNN) as the language model. Later, Xu et al. (2015) proposed Show, Attend, and Tell by integrating an attention mechanism between CNN encoder and RNN decoder in order to generate enhanced and more visually grounded captions. After Show, Attend, and Tell, attention mechanism became a standard and many other works employed and/or improved attention in image captioning methods (Lu et al., 2017; Chen et al., 2018; 2017) . Anderson et al. ( 2018) introduced another attention mechanism named Bottom-Up attention in addition to the Top-Down attention in Show, Attend, and Tell. In Bottom-Up attention, an object detection algorithm detects objects in the scene. Later, RNN attends to regional features for detected object instead of whole feature map. The idea of exploiting regional features for objects in the scene is employed in many image captioning methods (Qin et al., 2019; Ke et al., 2019; Wang et al., 2020) . The invention of transformers (Vaswani et al., 2017) caused a paradigm shift in NLP and transformers have become the go-to method in recent approaches (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) . Recently, they become quite popular in CV as well. Transformers are used as a feature extractor similar to CNNs (Dosovitskiy et al., 2021; Touvron et al., 2021) . These approaches, dubbed Vision Transformers, divide the image into patches and perform self-attention on these patches. Inspired from these approaches, some image captioning methods employed Vision Transformers in their Visual Encoder blocks (Liu et al., 2021; Wang et al., 2022) . Another branch of image captioning approaches focus on unified architecture for Visual Encoder and Language Model blocks by utilizing transformers. This kind of approach was first introduced in Zhou et al. (2020a), called Unified-VLP (Unified Vision and Language Pretraining). In Unified-VLP, a single transformer network is used for both encoding and decoding steps. It is also unified



Figure 1: An example image and a poor caption generated by VIVO (Hu et al., 2021; Zhang et al., 2021).

