NEAREST NEIGHBOR MACHINE TRANSLATION

Abstract

We introduce k-nearest-neighbor machine translation (kNN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. kNN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results-without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, kNN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples.

1. INTRODUCTION

Non-parametric methods have recently been successfully applied to tasks such as language modeling (Khandelwal et al., 2020) and question answering (Guu et al., 2020; Lewis et al., 2020) . They allow models that are (1) expressive, because they can use an arbitrary amount of data at test time; (2) adaptable, because predictions can be controlled by changing the datastore, and (3) interpretable, because the data used to make the prediction can be directly inspected. We introduce kNN-MT, a simple non-parametric method for machine translation (MT) using nearest neighbor retrieval. kNN-MT can be added to any pre-trained neural translation model without further training, and significantly improves performance for in-domain, out-of-domain, and multi-lingual evaluations. More specifically, kNN-MT interpolates the target-token softmax distribution from a neural MT model with a multinomial generated using nearest neighbor search over examples cached in a data store. The cache is over translation contexts (i.e. the complete source and prefix of the target), and is indexed by hidden states computed from the base MT model. We hypothesize that contexts which are close in representation space are more likely to be followed by the same target word. We show this is not only true for the original training data, thereby improving base model performance, but across a range of different bi-text corpora, allowing for simple and effective model adaptation. Our work builds upon recent results showing the effectiveness of nearest neighbor methods in unconditional language models (Khandelwal et al., 2020) . We generalize to conditional language models, by using both source and target context, and show nearest neighbour models can be effective for generation in addition to density estimation. Compared to prior work on non-parametric methods for MT, our approach is arguably simpler (in that it requires no training, as compared to Gu et al. Extensive experiments show that kNN-MT scales to datastores containing billions of tokens, improving results across a range of settings. For example, it improves a state-of-the-art German-English translation model by 1.5 BLEU. kNN-MT can also be used to adapt a single model to



(2018)) and more expressive (in that it provides access to billions of key-value pairs during inference, as compared toZhang et al. (2018); Gu et al. (2018)).

