CROWDSOURCED PHRASE-BASED TOKENIZATION FOR LOW-RESOURCED NEURAL MACHINE TRANSLATION: THE CASE OF FON LANGUAGE

Abstract

Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preprocessing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical, and tonal properties of some African languages. That, coupled with the extremely low availability of training samples, hinders the production of reliable NMT models. In this paper, using Fon language as a case study, we revisit standard tokenization methods and introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Furthermore, we compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.

1. INTRODUCTION

We would like to start by sharing with you this Fon sentence: « m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ ». How would you tokenize this? What happens if we implement the standard method of splitting the sentence into its word elements (either using the space delimiter or using subword units)? m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ Well we did that and discovered that a translation (to French) model, trained on sentences split this way, gave a literal translation of «chaque singe est entré dans la vie avec sa tête, son destin (English: each monkey entered the stage of life with its head, its destiny)» for the above Fon sentence. But we are not talking about a monkey here . It is a metaphor and so some of the words should be taken collectively as phrases. Using a phrase-based tokenizer, we got the following: m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ m¢tà m¢tà w¢ zìnwó h¢n wa aligbo m¢ A native speaker looking at some of these grouped phrases will quickly point out that some of the grouped phrases are wrong. Probably the phrase-based model could not effectively learn the phrases due to the low data it was trained on? Also, we got a translation of «singe chaque vient au monde dans vie avec tête et destin (English: monkey each comes into world in life with head and fate)» . However, this translation is still not correct. The expression actually means «Every human being is born with chances» . Another interpretation would be that we must be open to changes, and constantly be learning to take advantages of each situation in life . One illustrative example, which we encourage the reader to try, is to go to Google Translate and try translating «it costs an arm and a leg» to your language (native language or a language you understand). For the 20 languages we tried, all the translation results were wrong: literal and not fully conveying the true (some would say phrasal) expression or meaning. The expression «it costs an arm and a leg», just means «it is expensive». Now imagine a language with a sentence structure largely made up of such expressions -that is Fon . Tokenization is generally viewed as a solved problem. Yet, in practice, we often encounter difficulties in using standard tokenizers for NMT tasks, as shown above with Fon. This may be because of special tokenization needs for particular domains (like medicine (He & Kayaalp, 2006; Cruz Díaz & Maña López, 2015) ), or languages. Fon, one of the five classes of the Gbe language clusters (Aja, Ewe, Fon, Gen, and Phla-Phera according to (Capo, 2010)), is spoken by approximately 1.7 million people located in southwestern Nigeria, Benin, Togo, and southeastern Ghana. There exists approximately 53 different dialects of Fon spoken throughout Benin. Fon has complex grammar and syntax, is very tonal and diacritics are highly influential (Dossou & Emezue, 2020). Despite being spoken by 1.7 million speakers, Joshi et al. ( 2020) have categorized Fon as «left behind» or «understudied» in NLP. This poses a challenge when using standard tokenization methods. Given that most Fon sentences (and by extension most African languages) are like the sentence example given above (or the combination of such expressions), there is a need to re-visit tokenization of such languages. In this paper, using Fon in our experiment, we examine standard tokenization methods, and introduce the Word-Expressions-Based (WEB) tokenization. Furthermore, we test our tokenization strategy on the Fon-French and French-Fon translation tasks. Our main contributions are the dataset, our analysis and the proposal of WEB for extremely low-resourced African languages (ALRLs). The dataset, models and codes will be open-sourced on our Github page.

2. BACKGROUND AND MOTIVATION

Modern NMT models usually require large amount of parallel data in order to effectively learn the representations of morphologically rich source and target languages. While proposed solutions, such as transfer-learning from a high-resource language (HRL) to the low-resource language (LRL) (Gu et al., 2018; Renduchintala et al., 2018; Karakanta et al., 2018) , and using monolingual data (Sennrich et al., 2016a; Zhang & Zong, 2016; Burlot & Yvon, 2018; Hoang et al., 2018) , have proved effective, they are still not able to produce better translation results for most ALRLs. Standard tokenization methods, like Subword Units (SU) (Sennrich et al., 2015) , inspired by the byte-pair-encoding (BPE) (Gage, 1994), have greatly improved current NMT systems. However, studies have shown that BPE does not always boost performance of NMT systems for analytical languages (Abbott & Martinus, 2018) . Ngo et al. (2019) show that when morphological differences exist between source and target languages, SU does not significantly improve results. Therefore, there is a great need to revisit NMT with a focus on low-resourced, morphologically complex languages like Fon. This may involve taking a look at how to adapt standard NMT strategies to these languages.

