DIP BENCHMARK TESTS: EVALUATION BENCH-MARKS FOR DISCOURSE PHENOMENA IN MT

Abstract

Despite increasing instances of machine translation (MT) systems including extrasentential context information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark testsets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several competitive baseline MT systems on the curated datasets. Surprisingly, we find that the complex context-aware models that we test do not improve discourse-related translations consistently across languages and phenomena. Our evaluation benchmark is available as a leaderboard at <dipbenchmark1.github.io>.

1. INTRODUCTION AND RELATED WORK

The advances in neural machine translation (NMT) systems have led to great achievements in terms of state-of-the-art performance in automatic translation tasks. There have even been claims that their translations are no worse than what an average bilingual human may produce (Wu et al., 2016) or that the translations are on par with professional translators (Hassan et al., 2018) . However, extensive studies conducting evaluations with professional translators (Läubli et al., 2018; Popel et al., 2020) have shown that there is a statistically strong preference for human translations in terms of fluency and overall quality when evaluations are conducted monolingually or at the document level. Document (or discourse) level phenomena (e.g., coreference, coherence) may not seem lexically significant, but contribute significantly to readability and understandability of the translated texts (Guillou, 2012) . Targeted datasets for evaluating phenomena like coreference (Guillou et al., 2014; Guillou & Hardmeier, 2016; Lapshinova-Koltunski et al., 2018; Bawden et al., 2018; Voita et al., 2018b) , or ellipsis and lexical cohesion (Voita et al., 2019) , have been proposed. The NMT framework such as the Transformer (Vaswani et al., 2017) provides more flexibility to incorporate larger context. This has spurred a great deal of interest in developing context-aware NMT systems that take advantage of source or target contexts, e.g., Miculicich et al. ( 2018 Most studies only report performance on specific testsets, often limited to improvements in BLEU (Papineni et al., 2002) . Despite being the standard MT evaluation metric, BLEU has been criticised for its inadequacy; the scores are not interpretable, and are not sensitive to small improvements in lexical terms that may lead to big improvements in fluency or readability (Reiter, 2018) . There is no framework for a principled comparison of MT quality beyond mere lexical matching as done in BLEU: there are no standard corpora and no agreed-upon evaluation measures. To address these shortcomings, we propose the DiP benchmark tests (for Discourse Phenomena), that will enable the comparison of machine translation models across discourse task strengths and source languages. We create diagnostic testsets for four diverse discourse phenomena, and also propose automatic evaluation methods for these tasks. However, discourse phenomena in translations can be tricky to identify, let alone evaluate. A fair number of datasets proposed thus far have been manually curated, and automatic evaluation methods have often failed to agree with human 1



), Maruf & Haffari (2018), Voita et al. (2018b; 2019), Xiong et al. (2019), Wong et al. (2020), to name a few.

