GLOBETROTTER: UNSUPERVISED MULTILINGUAL TRANSLATION FROM VISUAL ALIGNMENT Anonymous authors Paper under double-blind review

Abstract

Machine translation in a multi-language scenario requires large-scale parallel corpora for every language pair. Unsupervised translation is challenging because there is no explicit connection between languages, and the existing methods have to rely on topological properties of the language representations. We introduce a framework that leverages visual similarity to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms prior work on unsupervised word-level and sentence-level translation using retrieval.



Machine translation aims to learn a mapping between sentences of different languages while also maintaining the underlying semantics. In the last few years, sequenceto-sequence models have emerged as remarkably powerful methods for this task, leading to widespread applications in robust language translation. However, sequenceto-sequence models also require large data sets of parallel corpora for learning, which is expensive to collect and often impractical for rare language pairs. We propose to leverage the synchronization between language and vision in order to learn models for machine translation without parallel training corpora. Instead of learning a direct mapping between languages, we present a model that aligns them by first mapping through a visual representation. We show how vision creates a transitive closure across modalities, which we use to establish positive and negative pairs of sentences without supervision. Since the visual appearance of scenes and objects will remain relatively stable between different spoken languages, vision acts as a "bridge" between them. Our approach integrates these transitive relations into multi-modal contrastive learning. In our experiments and visualizations we show that the transitive relations through vision provide excellent self-supervision for learning neural machine translation. Although we train our approach without paired language data, our approach is able to translate between 52 different languages better than several baselines. While vision is necessary for our approach during learning, there is no dependence on vision during inference. After learning the language representation, our approach can translate both individual words and full sentences using retrieval. The contributions of this paper are three-fold. First, we propose a method that leverages crossmodal alignment between language and vision to train a multilingual translation system without any parallel corpora. Second, we show that our method outperforms previous work by a significant margin on both sentence and word translation, where we use retrieval to test translation. Finally, to evaluate and analyze our approach, we release a federated multi-modal dataset spanning 52 different



Figure1: While each language represents a bicycle with a different word, the underlying visual representations remains consistent. A bicycle has similar appearance in the UK, France, Japan and India. We leverage this natural property to learn models of machine translation across multiple languages without paired training corpora.

