GuoFeng: A DISCOURSE-AWARE EVALUATION BENCH-MARK FOR LANGUAGE UNDERSTANDING, TRANSLA-TION AND GENERATION

Abstract

Modeling discourse -the linguistic phenomena that go beyond individual sentences, is a fundamental and challenging problem in natural language processing (NLP). However, existing evaluation benchmarks mainly focus on the evaluation of inter-sentence properties and overlook important discourse phenomena that cross sentences. To bridge the gap, we propose a GuoFeng benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. GuoFeng consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also propose a diagnostic test suite that can examine whether the target models learn discourse knowledge. We evaluate 17 general-and in-domain models based on Transformer and advanced pretraining architectures, showing that fine-grained pretraining based on document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field.

1. INTRODUCTION

To evaluate the general performance of models, previous work proposed a variety of benchmarks, covering different tasks and languages such as GLUE (Wang et al., 2018) , CLUE (Xu et al., 2020) and XGLUE (Liang et al., 2020) . However, existing benchmarks pay little attention to discourse phenomena, which is a fundamental and challenging problem in natural language processing (NLP) (Kevitt et al., 1992) . The natural language generally consists of meaningful, unified, and purposive groups of sentences, which are organized as a whole according to discourse properties (Cook, 1989) . As shown in Figure 1 , the discourse property manifests in two ways: (1) cohesion, where the dependency between words or phrases makes them logically and consistently connected; (2) coherence, where the structural relation between segments or sentences enables them semantically and meaningfully composed. To bridge the gap, we introduce a GuoFeng benchmark for the target evaluation on the discourse modeling. GuoFeng comprises three parts: • GuoFeng Benchmark: It consists of nine Chinese/English discourse-aware tasks covering a broad range of NLP tasks (understanding, translation, and generation), data quantities (from 26.4K to 2.4M), and difficulties. Besides, most benchmarking datasets are newly created in this work. • GuoFeng Diagnostic Dataset: To understand the discourse information learned by models, GuoFeng also includes a dataset of hand-crafted 600 examples for probing trained models. Each instance in the dataset is a contrastive pair, where the correct candidate is the original instance in the benchmark and the incorrect one is a perturbation by modifying discourse devises or structures in the correct candidates. • GuoFeng Training Data: We release a large-scale, document-level data (400G) in Chinese and English, which is in the same literature domain with the benchmark. The training data enables fine-grained pretraining to better model discourse information required by the benchmark. To better understand challenges posed by GuoFeng, we conduct experiments on a variety of stateof-the-art models, including Transformer and pretrained models. We found that these tasks display different levels of difficulty, resulting in different behaviors and performances across models. Furthermore, the fine-grained pretraining based on the document-level and discourse-rich GuoFeng data improves performances particularly on cohesive translation and coherent generation. However, the best models still achieve a fairly low absolute score, highlighting the difficulty of modeling discourse. There are three main contributions in this work: • Challenging Tasks: We propose a diverse set of discourse-aware tasks to evaluate monolingual and cross-lingual models' ability to understand and generate nature language. • Considerable Resources: We build and release a variety of discourse-aware resources, including benchmarking datasets, diagnostic test suite, large-scale pretraining corpus and discourse-aware pretrained models. • Comprehensive Comparisons: We systematically compare many advanced pretraining methods on the benchmark, and identify current challenges in discourse modelling for future exploration.

2. DISCOURSE-AWARE TASKS

To comprehensively evaluate the target models, GuoFeng covers three types of NLP tasks, including language understanding, translation and generation. We design the benchmarking tasks using the following criteria: (1) our tasks should measure the ability of models to handle discourse phenomena, thus we define discourse-related tasks at different levels of difficulty; (2) our datasets should contain rich discourse phenomena, thus we build document-level datasets with whole contexts extracted from literary texts. To this end, we introduce nine discourse-aware tasks, which are representative of challenging NLP tasks, and easily applicable to real-world situations.



Figure 1: Discourse definition and example. The concept of discourse is detailed in Appendix §A.

An overview of our discourse-aware evaluation benchmark, covering language understanding, translation and generation. All datasets consist of document-level texts in the literature domain, which are rich in discourse phenomena. Eight of them are newly created by us and one is expanded based on existing corpus (i.

