SHOW AND WRITE: ENTITY-AWARE ARTICLE GENERATION WITH IMAGE INFORMATION Anonymous authors Paper under double-blind review

Abstract

Prior work for article generation has primarily focused on generating articles using a human-written prompt to provide topical context and metadata about the article. However, for many applications, such as generating news stories, these articles are also often paired with images and their captions or alt-text, which in turn are based on real-world events and may reference many named entities that are difficult to be correctly recognized and predicted by language models. To address this, we introduce ENtity-aware article Generation with Image iNformation, ENGIN, to incorporate an article's image information into language models. ENGIN represents articles that can be conditioned on metadata used by prior work and information such as captions and named entities extracted from images. Our key contribution is a novel Entity-aware mechanism to help our model recognize and predict the entity names in articles, improving article generation. We perform experiments on three public datasets, GoodNews, VisualNews, and WikiText. Quantitative results show that our approach improves generated article perplexity by 4-5 points over the base models. Qualitative results demonstrate the text generated by ENGIN is more consistent with embedded article images. We also perform article quality annotation experiments on the generated articles to validate that our model produces higher-quality articles. Finally, we investigate the effect ENGIN has on methods that automatically detect machine-generated articles.

1. INTRODUCTION

Automatically writing articles is a complex and challenging language generation task. Developing a reliable article generation method enables a wide range of applications such as story generation (Fan et al., 2018; Peng et al., 2018 ), automated journalism (Leppänen et al., 2017; Brown et al., 2020) , defending against misinformation (Zellers et al., 2020; Tan et al., 2020) , writing Wiki articles (Banerjee & Mitra, 2016; Merity et al., 2016) , among other applications. In early work (Lake et al., 2017; Jia & Liang, 2017; Alcorn et al., 2019) , language models were trained using domain-specific data. These specialized methods worked well for in-domain data, but did not generalize to out-of-distribution inputs. To address this, language generators finetune large-scale pretrained language models (Radford et al., 2018; 2019; Brown et al., 2020) on domain-specific data such as news (Zellers et al., 2020) and Wikipedia (Merity et al., 2016) . These methods can generate articles with unconditional sampling given the first few sentences of articles (Radford et al., 2018; 2019) or with conditional sampling given metadata such as title and author (Zellers et al., 2020; Brown et al., 2020) . There are two important challenges not explored in prior work for article generation. First, they only model text (Brown et al., 2020; Zellers et al., 2020) (Figure 1 (a)), ignoring images embedded in the articles that may provide additional insights. Second, these methods only implicitly model named entities that commonly appear in long articles like organizations, places, and dates to provide context (Radford et al., 2019; Brown et al., 2020) . These named entities are critical to accurately modeling a long article, but it often is not known what named entities may appear at test time. To address these challenges, we propose an ENtity-aware article Generation framework with Image iNformation (ENGIN), which leverages image information and a novel entity-aware mechanism for article generation. Named entities indicate important contextual information about the events related to the news report in Figure 1 (b). However, in prior work (Radford et al., 2019; Brown et al., 2020; Zellers et al., 2020) named entities are modeled together with the other text, and the language 1

