SIMPLE AND SCALABLE NEAREST NEIGHBOR MA-CHINE TRANSLATION

Abstract

kNN- MT (Khandelwal et al., 2021) is a straightforward yet powerful approach for fast domain adaptation, which directly plugs pre-trained neural machine translation (NMT) models with domain-specific token-level k-nearest-neighbor (kNN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, kNN-MT is burdened with massive storage requirements and high computational complexity since it conducts nearest neighbor searches over the entire reference corpus. In this paper, we propose a simple and scalable nearest neighbor machine translation framework to drastically promote the decoding and storage efficiency of kNN-based models while maintaining the translation performance. To this end, we dynamically construct an extremely small datastore for each input via sentence-level retrieval to avoid searching the entire datastore in vanilla kNN-MT, based on which we further introduce a distance-aware adapter to adaptively incorporate the kNN retrieval results into the pre-trained NMT models. Experiments on machine translation in two general settings, static domain adaptation, and online learning, demonstrate that our proposed approach not only achieves almost 90% speed as the NMT model without performance degradation, but also significantly reduces the storage requirements of kNN-MT.

1. INTRODUCTION

Domain adaptation is one of the fundamental challenges in machine learning which aspires to cope with the discrepancy across domain distributions and improve the generality of the trained models. It has attracted wide attention in the neural machine translation (NMT) area (Britz et al., 2017; Chen et al., 2017; Chu & Wang, 2018; Bapna & Firat, 2019; Bapna et al., 2019; Wei et al., 2020) . Recently, kNN-MT and its variants (Khandelwal et al., 2021; Zheng et al., 2021a; b; Wang et al., 2022a) provide a new paradigm and have achieved remarkable performance for fast domain adaptation by retrieval pipelines. These approaches combine traditional NMT models (Bahdanau et al., 2015; Vaswani et al., 2017) with a token-level k-nearest-neighbour (kNN) retrieval mechanism, allowing it to directly access the domain-specific datastore to improve translation accuracy without fine-tuning the entire model. By virtue of this promising ability, a single kNN-MT can be seamlessly generalized to other domains by simply altering the external knowledge it attends to. In spite of significant achievements and potential benefits, the critical bottleneck of kNN-MT is its large token-level external knowledge (also called a datastore), which brings massive storage overhead and high latency during inference. For instance, Khandelwal et al. (2021) found that kNN-MT is two orders of magnitude slower than the base NMT system in a generation speed when retrieving 64 keys from a datastore containing billions of records. To ease this drawback and make kNN search more efficient, a line of works (Martins et al., 2022b; Wang et al., 2022a) proposed methods to reduce the volume of datastore, such as pruning the redundant records and reducing the dimension of keys. On another line, Meng et al. (2022) designed Fast kNN-MT to construct a smaller datastore for each source sentence instead of consulting the entire datastore. Typically, the small datastore is constructed by searching for the nearest token-level neighbors of the source tokens and mapping them to the corresponding target tokens. However, in essence, Fast kNN-MT Table 1 : Analysis on samples involved in kNN retrieval (k=8) when translating multi-domain German-to-English datasets (IT, Medical and Law). The left column includes the statistics of full data and correspondent kNN-MT performance, while the right column shows the averaged number of sentence pairs (Sents) and target tokens (Tokens), the storage overhead of datastore (Datastore) and correspondent performance of our proposed method (SK-MT) on these samples, respectively. migrates the inefficient kNN retrieval from the target side to the source side, resulting in still two times slower decoding speed than the standard NMT model. Despite its failure to speed up, the insights behind Fast kNN-MT inspire us that it is avoidable to leverage the entire datastore for nearest neighbor search. In addition to decoding acceleration, little attention has been paid to the storage overhead of kNN-MT, which also limits its practicability and promotion on real-world online services. Concretely, kNN-MT requires nearly 37GB of disk storage for creating a datastore from 467k law-domain sentence pairs, while the plain text takes less than 100MB, meaning prohibitive storage requirements when applied to a larger corpus. Both factors enlighten us to reflect on the necessity of constructing a complete datastore for kNN-MT. To this end, for each test input, we deeply investigate the involved sentence pairs during the decoding process of kNN-MT, and the results are shown in Table 1 . We collect the minimum sentence pairs that cover all decoding steps of kNN-MT on multi-domain German-to-English datasets, including IT, Medical and Law. Interestingly, we discover that during the whole decoding process, a scarce number of training samples (i.e., less than 16 samples on average) in the domain-specific reference corpus are required for each test sentence, while the storage overhead of the datastore built by these samples is negligible. Moreover, combining the dynamic datastore in kNN-MT is capable of achieving superior performance than vanilla kNN-MT. This phenomenon provides a new perspective that it is feasible to perform sentence-level retrieval to obtain a small number of similar samples for kNN-MT generation. Following this line, we propose a simple and scalable nearest neighbor machine translation framework (SK-MT) to handle the limitations of kNN-MT while keeping its performance. This approach first leverages current efficient text retrieval mechanisms, such as BM25 (Robertson & Zaragoza, 2009) , to obtain a small number of reference samples that have high overlaps with the input sentence, and then dynamically construct a tiny datastore by forwarding the samples to the pre-trained model. In this way, we avoid searching through the whole datastore for nearest neighbors and drastically improve the decoding and storage efficiency of kNN-MT. Based on that, a distance-aware adapter is introduced to adaptively incorporate the kNN retrieval results into the pre-trained NMT models. We conduct experiments in two settings, including static domain adaptation and online learning. Experimental results demonstrate that our proposed approach not only achieves almost 90% speed as the NMT model without performance degradation, but also significantly loosens the storage requirements of kNN-MT. Our code is open-sourced on https://github.com/dirkiedai/sk-mt.

2. BACKGROUND: kNN-MT

Recently, Khandelwal et al. (2021) proposed kNN-MT that augments pre-trained NMT models with a translation memory retriever, empowering models to attend to external knowledge and improve translation performance. It is generally formulated in two processes: datastore construction and inference with kNN retrieval. Datastore Construction. The datastore is a translation memory which converts bilingual sentence pairs into a set of key-value pairs. Given a reference corpus (x, y) ∈ (X , Y), the pre-trained NMT model generates the context representation f θ (x, y <t ) at each timestep t. Then we collect the output hidden state f θ (x, y <t ) as key and y t as value to construct the whole datastore (K, V): (K, V) = (x,y)∈(X ,Y) {(f θ (x, y <t ), y t ), ∀y t ∈ y}. (1)

