CROWDSOURCED PHRASE-BASED TOKENIZATION FOR LOW-RESOURCED NEURAL MACHINE TRANSLATION: THE CASE OF FON LANGUAGE

Abstract

Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preprocessing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical, and tonal properties of some African languages. That, coupled with the extremely low availability of training samples, hinders the production of reliable NMT models. In this paper, using Fon language as a case study, we revisit standard tokenization methods and introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Furthermore, we compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.

1. INTRODUCTION

We would like to start by sharing with you this Fon sentence: « m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ ». How would you tokenize this? What happens if we implement the standard method of splitting the sentence into its word elements (either using the space delimiter or using subword units)? m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ Well we did that and discovered that a translation (to French) model, trained on sentences split this way, gave a literal translation of «chaque singe est entré dans la vie avec sa tête, son destin (English: each monkey entered the stage of life with its head, its destiny)» for the above Fon sentence. But we are not talking about a monkey here . It is a metaphor and so some of the words should be taken collectively as phrases. Using a phrase-based tokenizer, we got the following: m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ 1

