On the use of linguistic similarities to improve Neural Machine Translation for African Languages

Abstract

In recent years, there has been a resurgence in research on empirical methods for machine translation. Most of this research has been focused on high-resource, European languages. Despite the fact that around 30% of all languages spoken worldwide are African, the latter have been heavily under investigated and this, partly due to the lack of public parallel corpora online. Furthermore, despite their large number (more than 2,000) and the similarities between them, there is currently no publicly available study on how to use this multilingualism (and associated similarities) to improve machine translation systems performance on African languages. So as to address these issues, we propose a new dataset (from a source that allows us to use and release) for African languages that provides parallel data for vernaculars not present in commonly used dataset like JW300. To exploit multilingualism, we first use a historical approach based on migrations of population to identify similar vernaculars. We also propose a new metric to automatically evaluate similarities between languages. This new metric does not require word level parallelism like traditional methods but only paragraph level parallelism. We then show that performing Masked Language Modelling and Translation Language Modeling in addition to multi-task learning on a cluster of similar languages leads to a strong boost of performance in translating individual pairs inside this cluster. In particular, we record an improvement of 29 BLEU on the pair Bafia-Ewondo using our approaches compared to previous work methods that did not exploit multilingualism in any way. Finally, we release the dataset and code of this work to ensure reproducibility and accelerate research in this domain.

1. Introduction

Machine Translation (MT) of African languages is a challenging problem because of multiple reasons. As pointed out by Martinus and Abbott (2019) , the main ones are: • Morphological complexity and diversity of African languages : Africa is home to around 2144 languages out of nowadays 7111 living languages (they thus make 30.15% of all living languages) with often different alphabets (Eberhard et al., 2020) . • Lack/Absence of large parallel datasets for most language pairs. • Discoverability: The existing resources for African languages are often hard to find. • Reproducibility: Data and code of existing research are rarely shared, making it difficult for other researchers to reproduce the results properly. • Lack of benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmark or leaderboard to compare new machine translation techniques to. However, despite the strong multilingualism of the continent, previous research focused solely on translating individual pairs of language without taking advantage of the potential similarities between all of them. The main purpose of this work is to exploit this multilingualism, starting from a few languages spoken in Central and West Africa, to produce better machine translation systems. Our contributions can be summarised as follows : 1. We provide new parallel corpora extracted from the Bible for several pairs of African vernacularfoot_0 that were not available until now as well as the code used to perform this extraction. 2. We present a method for aggregating languages together based on their historical origins, their morphologies, their geographical and cultural distributions etc... We also propose of a new metric to evaluate similarity between languages : this metric, based on language models, doesn't require word level parallelism (contrary to traditional like (Levenshtein, 1965) and (Swadesh, 1952) ) but only paragraph level parallelism. It also takes into account the lack of translation of words present in (Swadesh, 1952) but not in African vernaculars (like "snow"). 3. Using the language clusters created using the previous similarities, we show that Translation Language Modelling (Lample and Conneau, 2019) and multi-task learning generally improve the performance on individual pairs inside these clusters. Our code, data and benchmark are publicly available at https://github.com/Tikquuss/meta_ XLM  The rest of the paper is organised as follows: In section 2, we discuss the motivation of this work i.e. problems faced by African communities that could be solved with machine translation systems. Related work is described in 3 and relevant background is outlined in 4. In section 5, we describe our dataset and provide more details on the language aggregation methods. In section 6 we present the experimental protocol and the results of our methodology on machine translation of 3 African (Cameroonian) vernaculars. Lastly, we summarise our work and conclude in 7.

2. Motivation

Africa, because of its multilingual nature, faces a lot of communication problems. In particular, most African countries have colonial languages (French, English, Portuguese, etc...) as official languages (with sometimes one or two African dialects). The latter are those taught in schools and used in administrations and workplaces (which are often located in urban areas). In contrast, rural and remote areas citizens mainly speak African dialects (especially the older ones and the less educated ones). This situation creates the following fairness and ethical concerns (amongst others): • Remote area population often have difficulties communicating during medical consultations since they mainly speak the local languages, contrary to the doctors who mostly speak French, English, etc.... Similar situations arise when NGOs try to intervene in rural regions. • This discrepancy between spoken languages across different regions of individual African countries make the spread of misinformation easier (especially in rural regions and especially during election periods). • Young Africans (in particular those living in urban areas) struggle to learn their vernaculars. Lack of documentation and schools dealing with African languages as well as the scarcity of translators and trainers make the situation even more complicated. For all of the reasons above, it is essential to set up translation systems for these languages.



https://en.wikipedia.org/wiki/Vernacular

