CLMIU: COMMONSENSE LEARNING IN MULTIMODAL IMAGE UNDERSTANDING

Abstract

The problem of automatically describing the content of an image through accurate and meaningful captions has been attracting considerable attention among computer vision researchers. Recently, Transformers have been applied to image captioning to encode cross-modal information, in conjunction with Convolutional Neural Networks, which provide image region descriptions in terms of embeddings and object labels as input. However, the generated captions sometimes fail to capture the intentions, relationships, and abstract concepts that rely on general or commonsense knowledge. In this work we propose a novel network design, combining the strengths of Transformer models with graph-based models conveying external (common sense) knowledge. Our proposed architecture is a pure vision transformer-based image captioning model, with sequences of image patches used directly as input, without extracting any regional features. In particular, unlike the prior work, our architecture incorporates a knowledge-augmented encoder with a Transformer backbone to inject the external knowledge extracted from a knowledge graph. Furthermore, the bidirectional training on a vision-language corpus of image-text pairs, using modality specific self-supervised learning objectives, achieves promising results compared to the state-of-the-art. Our method has been trained from scratch on a small dataset, achieving a 3.8%, 2.7%, 3.2% and 6.3% improvement in BLEU@4, Meteor, Rouge and Cider scores respectively. We also reported competitive results on the NoCaps dataset, showing that the model generalizes to unseen object categories.

1. INTRODUCTION

Image captioning (IC) is an important research area of Computer Vision (CV) which addresses the problem of automatically describing the content of an image. The generated description includes the global scene, the objects contained in the image, their relationships as well as their attributes and the activities they are involved in. Training multimodal models on manually annotated paired image and text corpora aims to learn cross-modal representations that capture rich image and language semantics. Factual and commonsense knowledge are essential to how humans understand the world around them and learn about it. Factual knowledge refers to the specific details or elements of a subject (e.g. "London is the capital of the United Kingdom"). Commonsense knowledge includes information about events and their effects, about physical objects and how they are perceived, and about their properties and their relations to one another McCarthy et al. (1960) . A large amount of this knowledge is common to all humans, hence the term "common" in "common-sense". 2021) will result in models capable of acquiring commonsense knowledge. We argue that using exclusively pre-trained language models and the concepts learned by them cannot provide sufficient information for image captioning. Incorporating external commonsense knowledge into the image captioning methods relies primarily on the intuition that human beings produce accurate and semantically rich descriptions of scenes by exploiting this source of information. In Figure 1 , it can be seen that the woman is getting ready to blow out the candles on the cake. We can infer that the occasion might be a birthday celebration just by looking at the cake, the candles and the overall composition of the scene. This is the type of knowledge that is very relevant for the task of captioning. It is typically encoded in a Knowledge Graph (KG) in the form of nodes representing objects and their relationships. As can be seen in the figure, the incorporation of external knowledge from a knowledge graph improves the caption by suggesting the semantic concept of 'birthday'. Furthermore, current vision and language models often leverage a frozen object detector, like Faster R-CNN, pre-trained on labelled object detection datasets. Such approaches are limited by the granularity of these datasets, and so, are less scalable. In this paper, we introduce CLMIU, short for Commonsense Learning in Multimodal Image Understanding. CLMIU's objective is to incorporate factual and commonsense knowledge in a multimodal network by training on a small dataset with access to external information. Aiming to learn accurate object-detector free image descriptions beyond explicit image content, we train CLMIU to a) recover masked language and transformed vision tokens from its contextualized vector representations, and to b), identify the correct alignment of the textual concepts to their corresponding image regions in the image-text pairs. We show both quantitatively and qualitatively that CLMIU has a strong understanding of factual and commonsense elements needed to accurately describe natural images. When using considerably less image-text data for training, CLMIU outperforms strong baselines like Oscar Li et al. (2020b ), VinVL Zhang et al. (2021b ) and Vilt Kim et al. (2021) . Specifically, our method achieves a 3.8%, 2.7%, 3.2% and 6.3% improvement in BLEU@4 Papineni et al. ( 2002 2019) dataset show that the model generalizes to unseen object categories. An analysis of CLMIU's attention patterns (supplementary material Appendix Figure 4 ) shows that image regions attend to text tokens that are conceptually related. Finally, ablation studies of CLMIU show that 1) external knowledge injection works better when added to the last layer(s) rather than from the start of the multimodal encoder, 2) existing methods improve by re-training with the external knowledge encoder, 3) using Group Mask Model Learning (GMML) Atito et al. (2021b) for the vision encoding leads to better image representations and 4) CLMIU's performance steadily improves by training longer. The combination of these results suggests that incorporating factual and commonsense knowledge into image captioning models is a promising path forward for future research. In summary, our main contributions are: 1. CLMIU, a performant end-to-end vision and language model, that learns multimodal image representations from images and their captions incorporating external knowledge. To the best of our knowledge, CLMIU is the first to propose the use of an external knowl-



Commonsense knowledge is hard to compute/learn by machine learning models. Therefore, incorporating commonsense information is at present a key problem facing machine learning research Klein & Nabi (2019); Zhou et al. (2019); Zhang et al. (2019); Wang et al. (2020); Liu et al. (2020). Even the state-of-the-art (SOTA) models in image captioning ignore this type of knowledge Li et al. (2019a;b); Lu et al. (2019); Tan & Bansal (2019); Chen et al. (2020); Desai & Johnson (2020); Li et al. (2020b); Hu et al. (2020); Zhang et al. (2021b). Though some captions might hint at learning elaborated abstract concepts, it is not obvious that even training on, e.g., 1.8B image/text pairs Wang et al. (

Figure 1: This is an example image showing a case when identifying semantic concepts, not explicitly represented in the scene, would help to provide a better description. Commonsense knowledge is used to relate the elements, namely people, cake and candle, to the concept of a birthday. The comparison between the baseline caption and the one generated by our method shows the benefit of incorporating external knowledge. The relationship information is extracted from the Concept-Net Speer et al. knowledge graph.

), METEOR Denkowski & Lavie (2014), Rouge Lin (2004) and CIDEr Vedantam et al. (2015) scores respectively on the MS COCO Captions dataset Chen et al. (2015). This improvement in caption generation emerges during pre-training with an external knowledge source. The results obtained on the NoCaps Agrawal et al. (

