GuoFeng: A DISCOURSE-AWARE EVALUATION BENCH-MARK FOR LANGUAGE UNDERSTANDING, TRANSLA-TION AND GENERATION

Abstract

Modeling discourse -the linguistic phenomena that go beyond individual sentences, is a fundamental and challenging problem in natural language processing (NLP). However, existing evaluation benchmarks mainly focus on the evaluation of inter-sentence properties and overlook important discourse phenomena that cross sentences. To bridge the gap, we propose a GuoFeng benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. GuoFeng consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also propose a diagnostic test suite that can examine whether the target models learn discourse knowledge. We evaluate 17 general-and in-domain models based on Transformer and advanced pretraining architectures, showing that fine-grained pretraining based on document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field.

1. INTRODUCTION

To evaluate the general performance of models, previous work proposed a variety of benchmarks, covering different tasks and languages such as GLUE (Wang et al., 2018 ), CLUE (Xu et al., 2020) and XGLUE (Liang et al., 2020) . However, existing benchmarks pay little attention to discourse phenomena, which is a fundamental and challenging problem in natural language processing (NLP) (Kevitt et al., 1992) . The natural language generally consists of meaningful, unified, and purposive groups of sentences, which are organized as a whole according to discourse properties (Cook, 1989) . As shown in Figure 1 , the discourse property manifests in two ways: (1) cohesion, where the dependency between words or phrases makes them logically and consistently connected; (2) coherence, where the structural relation between segments or sentences enables them semantically and meaningfully composed. To bridge the gap, we introduce a GuoFeng benchmark for the target evaluation on the discourse modeling. GuoFeng comprises three parts: • GuoFeng Benchmark: It consists of nine Chinese/English discourse-aware tasks covering a broad range of NLP tasks (understanding, translation, and generation), data quantities (from 26.4K to 2.4M), and difficulties. Besides, most benchmarking datasets are newly created in this work. • GuoFeng Diagnostic Dataset: To understand the discourse information learned by models, GuoFeng also includes a dataset of hand-crafted 600 examples for probing trained models. Each instance in the dataset is a contrastive pair, where the correct candidate is the original instance in 1



Figure 1: Discourse definition and example. The concept of discourse is detailed in Appendix §A.

