CLMIU: COMMONSENSE LEARNING IN MULTIMODAL IMAGE UNDERSTANDING

Abstract

The problem of automatically describing the content of an image through accurate and meaningful captions has been attracting considerable attention among computer vision researchers. Recently, Transformers have been applied to image captioning to encode cross-modal information, in conjunction with Convolutional Neural Networks, which provide image region descriptions in terms of embeddings and object labels as input. However, the generated captions sometimes fail to capture the intentions, relationships, and abstract concepts that rely on general or commonsense knowledge. In this work we propose a novel network design, combining the strengths of Transformer models with graph-based models conveying external (common sense) knowledge. Our proposed architecture is a pure vision transformer-based image captioning model, with sequences of image patches used directly as input, without extracting any regional features. In particular, unlike the prior work, our architecture incorporates a knowledge-augmented encoder with a Transformer backbone to inject the external knowledge extracted from a knowledge graph. Furthermore, the bidirectional training on a vision-language corpus of image-text pairs, using modality specific self-supervised learning objectives, achieves promising results compared to the state-of-the-art. Our method has been trained from scratch on a small dataset, achieving a 3.8%, 2.7%, 3.2% and 6.3% improvement in BLEU@4, Meteor, Rouge and Cider scores respectively. We also reported competitive results on the NoCaps dataset, showing that the model generalizes to unseen object categories.

1. INTRODUCTION

Image captioning (IC) is an important research area of Computer Vision (CV) which addresses the problem of automatically describing the content of an image. The generated description includes the global scene, the objects contained in the image, their relationships as well as their attributes and the activities they are involved in. Training multimodal models on manually annotated paired image and text corpora aims to learn cross-modal representations that capture rich image and language semantics. Factual and commonsense knowledge are essential to how humans understand the world around them and learn about it. Factual knowledge refers to the specific details or elements of a subject (e.g. "London is the capital of the United Kingdom"). Commonsense knowledge includes information about events and their effects, about physical objects and how they are perceived, and about their properties and their relations to one another McCarthy et al. (1960) . A large amount of this knowledge is common to all humans, hence the term "common" in "common-sense". (2021b) . Though some captions might hint at learning elaborated abstract concepts, it is not obvious that even training on, e.g., 1.8B image/text pairs Wang et al. (2021) will result in models capable of acquiring commonsense knowledge. We argue that using exclusively pre-trained language models and the concepts learned by them cannot provide sufficient information for image captioning. Incorporating external commonsense knowledge into the image captioning methods relies primarily on the intuition that human beings produce 1



Commonsense knowledge is hard to compute/learn by machine learning models. Therefore, incorporating commonsense information is at present a key problem facing machine learning research Klein & Nabi (2019); Zhou et al. (2019); Zhang et al. (2019); Wang et al. (2020); Liu et al. (2020). Even the state-of-the-art (SOTA) models in image captioning ignore this type of knowledge Li et al. (2019a;b); Lu et al. (2019); Tan & Bansal (2019); Chen et al. (2020); Desai & Johnson (2020); Li et al. (2020b); Hu et al. (2020); Zhang et al.

