GLOBETROTTER: UNSUPERVISED MULTILINGUAL TRANSLATION FROM VISUAL ALIGNMENT Anonymous authors Paper under double-blind review

Abstract

Machine translation in a multi-language scenario requires large-scale parallel corpora for every language pair. Unsupervised translation is challenging because there is no explicit connection between languages, and the existing methods have to rely on topological properties of the language representations. We introduce a framework that leverages visual similarity to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms prior work on unsupervised word-level and sentence-level translation using retrieval.

1. INTRODUCTION

自転車 साइ�कल bicycle Figure 1: While each language represents a bicycle with a different word, the underlying visual representations remains consistent. A bicycle has similar appearance in the UK, France, Japan and India. We leverage this natural property to learn models of machine translation across multiple languages without paired training corpora. Machine translation aims to learn a mapping between sentences of different languages while also maintaining the underlying semantics. In the last few years, sequenceto-sequence models have emerged as remarkably powerful methods for this task, leading to widespread applications in robust language translation. However, sequenceto-sequence models also require large data sets of parallel corpora for learning, which is expensive to collect and often impractical for rare language pairs. We propose to leverage the synchronization between language and vision in order to learn models for machine translation without parallel training corpora. Instead of learning a direct mapping between languages, we present a model that aligns them by first mapping through a visual representation. We show how vision creates a transitive closure across modalities, which we use to establish positive and negative pairs of sentences without supervision. Since the visual appearance of scenes and objects will remain relatively stable between different spoken languages, vision acts as a "bridge" between them. Our approach integrates these transitive relations into multi-modal contrastive learning. In our experiments and visualizations we show that the transitive relations through vision provide excellent self-supervision for learning neural machine translation. Although we train our approach without paired language data, our approach is able to translate between 52 different languages better than several baselines. While vision is necessary for our approach during learning, there is no dependence on vision during inference. After learning the language representation, our approach can translate both individual words and full sentences using retrieval. The contributions of this paper are three-fold. First, we propose a method that leverages crossmodal alignment between language and vision to train a multilingual translation system without any parallel corpora. Second, we show that our method outperforms previous work by a significant margin on both sentence and word translation, where we use retrieval to test translation. Finally, to evaluate and analyze our approach, we release a federated multi-modal dataset spanning 52 different languages. Overall, our work shows that grounding language in vision helps developing language processing tools that are robust across languages, even in cases where ground truth alignment across languages is not available. Code, data, and pre-trained models will be released.

2. RELATED WORK

Our unsupervised joint visual and multilingual model builds on recent progress in both the natural language processing and computer vision communities. We briefly summarize the prior work. 2019) build on top of this idea, and train an encoderdecoder structure to enforce cycle-consistency when translating from one language to another and back to the first one. This method achieves strong unsupervised word translation results, but does not scale beyond two languages. It also does not leverage visual information in learning. Multi-language models are general language models that develop language-independent architectures that work equally well for any language (Gerz et al., 2018) Vision as multi-modal bridge implies using vision as an interlingua between all languages. Using a third language as a pivot to translate between pairs of languages without source-target paired corpora has been studied for the past few years (e.g. Firat et al., 2016; Johnson et al., 2017; Garcia et al., 2020) . Harwath et al. ( 2018 2020) also use vision as a pivot for unsupervised translation. However, our approach works for multiple languages at once (instead of just two) and also obtains an explicit cross-lingual alignment. We share a single word embedding and language model for all languages, and use different training strategies. Our experiments quantitatively compare the two approaches, showing that our approach performs better both in word and sentence translation. Other work views the input image as extra information for translation (e.g. Calixto & Liu, 2017; Su et al., 2019 ), and we refer readers to Specia et al. (2016) for an extensive overview on this topic. Instead of using images as a bridge, paired data between languages is used. There has also been research on training multilingual language representations for downstream vision tasks, in general leveraging visual-language correspondence, but without translation as a goal. Unlike this paper, they make use of ground truth language pairs (Wehrmann et al., 2019; Gella et al., 2017; Kim et al., 2020; Burns et al., 2020) . Translation by retrieval. We evaluate the representations using retrieval-based machine translation (Baldwin & Tanaka, 2000; Liu et al., 2012) , which is often used in the context of example-based machine translation (e.g. Brown, 1996; 2001; 1997; Cranias et al., 1994; El-Shishtawy & El-Sammak, 2014 ), analogy-based translation (e.g. Nagao, 1984; Kimura et al., 2014) , or translation memories (e.g. Chatzitheodorou, 2015; Dong et al., 2014; Wäschle & Riezler, 2015; Baldwin, 2001) . While there are also generative-based translation approaches, they are difficult to automatically evaluate. There is generally no well-defined metric for what consists of a good generative translation (Callison-Burch et al., 2006) . Instead, we evaluate our approach using translation-by-retrieval, allowing for rigorous experimental validation of the cross-lingual alignment in the representation. State-of-the-art cross-lingual retrieval approaches rely on supervised language pairs, and range from training the models in a standard contrastive learning setting (Chi et al., 2020) to more complex combinations of the language pairs such as using cross-attention (Anonymous, 2021) or introducing custom fusion layers (Fang et al., 2020) . Our approach does not require supervised language pairs.



Unsupervised language translation has been studied as a word representation alignment problem in Lample et al. (2018b), where the distribution of word embeddings for two unpaired languages is aligned to minimize a statistical distance between them. Lample et al. (2018a); Artetxe et al. (2018); Lample et al. (2018c); Lample & Conneau (

. Lample & Conneau (2019); Conneau et al. (2020); Artetxe & Schwenk (2019); Devlin et al. (2019); Liu et al. (2020); Phang et al. (2020) share the same token embeddings across different languages, showing that this improves language modeling both for general downstream single-language NLP tasks and also for supervised language translation across multiple languages. Lample & Conneau (2019); Conneau et al. (2020); Artetxe & Schwenk (2019) use a shared Byte Pair Encoding (BPE), which we use in our work. We loosely follow the architecture of Conneau et al. (2020) in that we train a transformer-based (Vaswani et al., 2017) masked language model with BPE.

); Azuh et al. (2019) use vision for the same purpose, and they work directly on the speech signal instead of text. Chen et al. (2018) use images to help translate between languages in the text modality. Their model involves both generation and reinforcement learning, which makes optimization difficult, and they do not generalize to more than two languages. Sigurdsson et al. (

