SHOW AND WRITE: ENTITY-AWARE ARTICLE GENERATION WITH IMAGE INFORMATION Anonymous authors Paper under double-blind review

Abstract

Prior work for article generation has primarily focused on generating articles using a human-written prompt to provide topical context and metadata about the article. However, for many applications, such as generating news stories, these articles are also often paired with images and their captions or alt-text, which in turn are based on real-world events and may reference many named entities that are difficult to be correctly recognized and predicted by language models. To address this, we introduce ENtity-aware article Generation with Image iNformation, ENGIN, to incorporate an article's image information into language models. ENGIN represents articles that can be conditioned on metadata used by prior work and information such as captions and named entities extracted from images. Our key contribution is a novel Entity-aware mechanism to help our model recognize and predict the entity names in articles, improving article generation. We perform experiments on three public datasets, GoodNews, VisualNews, and WikiText. Quantitative results show that our approach improves generated article perplexity by 4-5 points over the base models. Qualitative results demonstrate the text generated by ENGIN is more consistent with embedded article images. We also perform article quality annotation experiments on the generated articles to validate that our model produces higher-quality articles. Finally, we investigate the effect ENGIN has on methods that automatically detect machine-generated articles.

1. INTRODUCTION

Automatically writing articles is a complex and challenging language generation task. Developing a reliable article generation method enables a wide range of applications such as story generation (Fan et al., 2018; Peng et al., 2018) , automated journalism (Leppänen et al., 2017; Brown et al., 2020) , defending against misinformation (Zellers et al., 2020; Tan et al., 2020) , writing Wiki articles (Banerjee & Mitra, 2016; Merity et al., 2016) , among other applications. In early work (Lake et al., 2017; Jia & Liang, 2017; Alcorn et al., 2019) , language models were trained using domain-specific data. These specialized methods worked well for in-domain data, but did not generalize to out-of-distribution inputs. To address this, language generators finetune large-scale pretrained language models (Radford et al., 2018; 2019; Brown et al., 2020) on domain-specific data such as news (Zellers et al., 2020) and Wikipedia (Merity et al., 2016) . These methods can generate articles with unconditional sampling given the first few sentences of articles (Radford et al., 2018; 2019) or with conditional sampling given metadata such as title and author (Zellers et al., 2020; Brown et al., 2020) . There are two important challenges not explored in prior work for article generation. First, they only model text (Brown et al., 2020; Zellers et al., 2020) (Figure 1 (a)), ignoring images embedded in the articles that may provide additional insights. Second, these methods only implicitly model named entities that commonly appear in long articles like organizations, places, and dates to provide context (Radford et al., 2019; Brown et al., 2020) . These named entities are critical to accurately modeling a long article, but it often is not known what named entities may appear at test time. To address these challenges, we propose an ENtity-aware article Generation framework with Image iNformation (ENGIN), which leverages image information and a novel entity-aware mechanism for article generation. Named entities indicate important contextual information about the events related to the news report in Figure 1 (b). However, in prior work (Radford et al., 2019; Brown et al., 2020; Zellers et al., 2020) named entities are modeled together with the other text, and the language model may find it difficult to distinguish entity names from the other text in articles. To solve this issue, we propose an entity-aware mechanism to help ENGIN recognize and predict named entities. Specifically, we insert special tokens after each entity name to indicate its entity category. ENGIN models the named entity with its entity category jointly. An additional benefit brought by our Entityaware mechanism is the named-entity recognition (NER) ability, i.e., our model not only recognizes and predicts the entity names but also predicts the entity category simultaneously. Prior work has proposed entity-aware mechanisms for related tasks like news image captioning (e.g., (Biten et al., 2019; Tran et al., 2020; Liu et al., 2020) ). However, these methods do not generalize to article generation as they rely on having a lot of contextual information (an article) as well as a direct indication of what entity they need to generate a caption for (from the image). In contrast, when generating an article conditioned on a collection of images and captions, the model has to select when to use each entity in the metadata. In addition, some key entities may not be present in the metadata. For example, in Figure 1 (a) the article mentioned Angelina Jolie Pitt is an Oscar winner, but this entity doesn't appear in the image or caption shown in Figure 1 (b). In news image captioning, entities used in the caption almost always appear in the body of the article (Liu et al., 2020; Tan et al., 2020) . Thus, as we will show, adapting entity-aware mechanisms used in prior work (e.g., (Liu et al., 2020; Dong et al., 2021) ) results in poor performance on our task. While we show that providing a list of named entities for generating an article boosts performance, it does require a small overhead cost. To address this, we also show gains without manual input. Specifically, as shown in Figure 1 (b), images and captions can contain important events or key figures associated with an article. Since large vision-language models have seen various named entities during training, we use CLIP (Radford et al., 2021) to automatically select a set of likely named entities from an image (see Section 3). Figure 2 presents the overall pipeline of ENGIN. In summary, the contributions of this paper are: • We propose an entity-aware language model called ENGIN for article generation. Compared to existing models focusing on text-only context (Zellers et al., 2020; Brown et al., 2020) , ENGIN effectively leverages information from images and captions to generate high-quality articles. • We propose an Entity-aware mechanism to help language models better recognize and predict named entities by also modeling entity categories, boosting performance. • Experiments on GoodNews (Biten et al., 2019) and VisualNews (Liu et al., 2020) report 1.5B params ENGIN-XL boosts perplexity by 2.5pts over 6B param GPT-J (Wang & Komatsuzaki, 2021) . We also show ENGIN generalize via zero-shot transfer to WikiText (Merity et al., 2016) . • We perform a user study to verify that ENGIN produces more realistic news compared to prior work (Radford et al., 2019; Zellers et al., 2020) . This suggests that our model may help provide additional training data for learning more powerful machine-generated text detectors.

2. RELATED WORK

Article Generation in recent research produces text using large-scale pretrained transformer models, which can be divided into two categories: unconditional text generation (Radford et al., 2018;  



Figure 1: Prior work (Brown et al., 2020; Zellers et al., 2020), shown in (a), produces an article (black text) conditioned on article metadata (gray text), ignoring image information. This paper, shown in (b), also conditions on image information like extracted named entities, which may provide important context (e.g., knowing the woman in the image is an actress), when generating articles.

