DIP BENCHMARK TESTS: EVALUATION BENCH-MARKS FOR DISCOURSE PHENOMENA IN MT

Abstract

Despite increasing instances of machine translation (MT) systems including extrasentential context information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark testsets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several competitive baseline MT systems on the curated datasets. Surprisingly, we find that the complex context-aware models that we test do not improve discourse-related translations consistently across languages and phenomena. Our evaluation benchmark is available as a leaderboard at <dipbenchmark1.github.io>.

1. INTRODUCTION AND RELATED WORK

The advances in neural machine translation (NMT) systems have led to great achievements in terms of state-of-the-art performance in automatic translation tasks. There have even been claims that their translations are no worse than what an average bilingual human may produce (Wu et al., 2016) or that the translations are on par with professional translators (Hassan et al., 2018) . However, extensive studies conducting evaluations with professional translators (Läubli et al., 2018; Popel et al., 2020) have shown that there is a statistically strong preference for human translations in terms of fluency and overall quality when evaluations are conducted monolingually or at the document level. Document (or discourse) level phenomena (e.g., coreference, coherence) may not seem lexically significant, but contribute significantly to readability and understandability of the translated texts (Guillou, 2012) . Targeted datasets for evaluating phenomena like coreference (Guillou et al., 2014; Guillou & Hardmeier, 2016; Lapshinova-Koltunski et al., 2018; Bawden et al., 2018; Voita et al., 2018b) , or ellipsis and lexical cohesion (Voita et al., 2019) , have been proposed. The NMT framework such as the Transformer (Vaswani et al., 2017) provides more flexibility to incorporate larger context. This has spurred a great deal of interest in developing context-aware NMT systems that take advantage of source or target contexts, e.g., Miculicich et al. (2018 ), Maruf & Haffari (2018 ), Voita et al. (2018b; 2019 ), Xiong et al. (2019 ), Wong et al. (2020) , to name a few. Most studies only report performance on specific testsets, often limited to improvements in BLEU (Papineni et al., 2002) . Despite being the standard MT evaluation metric, BLEU has been criticised for its inadequacy; the scores are not interpretable, and are not sensitive to small improvements in lexical terms that may lead to big improvements in fluency or readability (Reiter, 2018) . There is no framework for a principled comparison of MT quality beyond mere lexical matching as done in BLEU: there are no standard corpora and no agreed-upon evaluation measures. To address these shortcomings, we propose the DiP benchmark tests (for Discourse Phenomena), that will enable the comparison of machine translation models across discourse task strengths and source languages. We create diagnostic testsets for four diverse discourse phenomena, and also propose automatic evaluation methods for these tasks. However, discourse phenomena in translations can be tricky to identify, let alone evaluate. A fair number of datasets proposed thus far have been manually curated, and automatic evaluation methods have often failed to agree with human judgments (Guillou & Hardmeier, 2018) . To mitigate these issues, we use trained neural models for identifying and evaluating complex discourse phenomena and conduct extensive user studies to ensure agreements with human judgments. Our methods for automatically extracting testsets can be applied to multiple languages, and find cases that are difficult to translate without having to resort to synthetic data. Moreover, our testsets are extracted in a way that makes them representative of current challenges. They can be easily updated to reflect future challenges, preventing the pitfall of becoming outdated, which is a common failing of many benchmarking testsets. We also benchmark established MT models on these testsets to convey the extent of the challenges they pose. Although discourse phenomena can and do occur at the sentence-level (e.g., between clauses), we would expect MT systems that model extra-sentential context (Voita et al., 2018b; Zhang et al., 2018; Miculicich et al., 2018) to be more successful on these tasks. However, we observe significant differences in system behavior and quality across languages and phenomena, emphasizing the need for more extensive evaluation as a standard procedure. We propose to maintain a leaderboard that tracks and highlights advances in MT quality that go beyond BLEU improvement. Our main contributions in this paper are as follows: • Benchmark testsets for four discourse phenomena: anaphora, coherence & readability, lexical consistency, and discourse connectives. • Automatic evaluation methods and agreements with human judgments. • Benchmark evaluation and analysis of four context-aware systems contrasted with baselines, for German/Russian/Chinese-English language pairs.

2. MACHINE TRANSLATION MODELS

Model Architectures. We first introduce the MT systems that we will be benchmarking on our testsets. We evaluate a selection of established models of various complexities (simple sentencelevel to complex context-aware models), taking care to include both source-and target-side contextaware models. We briefly describe the model architectures here: former framework to dynamically attend to the context at two levels: word and sentence. They achieve the highest BLEU when hierarchical attention is applied separately to both the encoder and decoder.

Datasets and Training.

The statistics for the datasets used to train the models are shown in Table 1. We tokenize the data using Jiebafoot_0 for Zh and Moses scriptsfoot_1 for the other languages, lowercase the text, and apply BPE encodingsfoot_2 from Sennrich et al. (2016) . We learn the BPE encodings with the command learn-joint-bpe-and-vocab -s 40000. The scores reported are BLEU4, computed either through fairseq or NLTK (Wagner, 2010) . Further details about dataset composition, training settings and hyperparameters can be found in the Appendix (A.7).



https://github.com/fxsjy/jieba https://www.statmt.org/moses/ https://github.com/rsennrich/subword-nmt/



S2S: A standard 6-layer base Transformer model (Vaswani et al., 2017) which translates sentences independently. • CONCAT: A 6-layer base Transformer whose input is two sentences (previous and current sentence) merged, with a special character as a separator (Tiedemann & Scherrer, 2017). • ANAPH: Voita et al. (2018b) incorporate source context by encoding it with a separate encoder, then fusing it in the last layer of a standard Transformer encoder using a gate. They claim that their model explicitly captures anaphora resolution. • TGTCON: To model target-context, we implement a version of ANAPH with an extra operation of multi-head attention in the decoder, computed between representations of the target sentence and target context. The architecture is described in detail in the Appendix (A.5). • SAN: Zhang et al. (2018) use source attention network: a separate Transformer encoder to encode source context, which is incorporated into the source encoder and target decoder using gates. • HAN: Miculicich et al. (2018) introduce a hierarchical attention network (HAN) into the Trans-

