SIMPLE AND SCALABLE NEAREST NEIGHBOR MA-CHINE TRANSLATION

Abstract

kNN- MT (Khandelwal et al., 2021) is a straightforward yet powerful approach for fast domain adaptation, which directly plugs pre-trained neural machine translation (NMT) models with domain-specific token-level k-nearest-neighbor (kNN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, kNN-MT is burdened with massive storage requirements and high computational complexity since it conducts nearest neighbor searches over the entire reference corpus. In this paper, we propose a simple and scalable nearest neighbor machine translation framework to drastically promote the decoding and storage efficiency of kNN-based models while maintaining the translation performance. To this end, we dynamically construct an extremely small datastore for each input via sentence-level retrieval to avoid searching the entire datastore in vanilla kNN-MT, based on which we further introduce a distance-aware adapter to adaptively incorporate the kNN retrieval results into the pre-trained NMT models. Experiments on machine translation in two general settings, static domain adaptation, and online learning, demonstrate that our proposed approach not only achieves almost 90% speed as the NMT model without performance degradation, but also significantly reduces the storage requirements of kNN-MT.

1. INTRODUCTION

Domain adaptation is one of the fundamental challenges in machine learning which aspires to cope with the discrepancy across domain distributions and improve the generality of the trained models. It has attracted wide attention in the neural machine translation (NMT) area (Britz et al., 2017; Chen et al., 2017; Chu & Wang, 2018; Bapna & Firat, 2019; Bapna et al., 2019; Wei et al., 2020) . Recently, kNN-MT and its variants (Khandelwal et al., 2021; Zheng et al., 2021a; b; Wang et al., 2022a) provide a new paradigm and have achieved remarkable performance for fast domain adaptation by retrieval pipelines. These approaches combine traditional NMT models (Bahdanau et al., 2015; Vaswani et al., 2017) with a token-level k-nearest-neighbour (kNN) retrieval mechanism, allowing it to directly access the domain-specific datastore to improve translation accuracy without fine-tuning the entire model. By virtue of this promising ability, a single kNN-MT can be seamlessly generalized to other domains by simply altering the external knowledge it attends to. In spite of significant achievements and potential benefits, the critical bottleneck of kNN-MT is its large token-level external knowledge (also called a datastore), which brings massive storage overhead and high latency during inference. For instance, Khandelwal et al. (2021) found that kNN-MT is two orders of magnitude slower than the base NMT system in a generation speed when retrieving 64 keys from a datastore containing billions of records. To ease this drawback and make kNN search more efficient, a line of works (Martins et al., 2022b; Wang et al., 2022a) proposed methods to reduce the volume of datastore, such as pruning the redundant records and reducing the dimension of keys. On another line, Meng et al. (2022) designed Fast kNN-MT to construct a smaller datastore for each source sentence instead of consulting the entire datastore. Typically, the small datastore is constructed by searching for the nearest token-level neighbors of the source tokens and mapping them to the corresponding target tokens. However, in essence, Fast kNN-MT Table 1 : Analysis on samples involved in kNN retrieval (k=8) when translating multi-domain German-to-English datasets (IT, Medical and Law) . The left column includes the statistics of full data and correspondent kNN-MT performance, while the right column shows the averaged number of sentence pairs (Sents) and target tokens (Tokens), the storage overhead of datastore (Datastore) and correspondent performance of our proposed method (SK-MT) on these samples, respectively. migrates the inefficient kNN retrieval from the target side to the source side, resulting in still two times slower decoding speed than the standard NMT model. Despite its failure to speed up, the insights behind Fast kNN-MT inspire us that it is avoidable to leverage the entire datastore for nearest neighbor search. In addition to decoding acceleration, little attention has been paid to the storage overhead of kNN-MT, which also limits its practicability and promotion on real-world online services. Concretely, kNN-MT requires nearly 37GB of disk storage for creating a datastore from 467k law-domain sentence pairs, while the plain text takes less than 100MB, meaning prohibitive storage requirements when applied to a larger corpus. Both factors enlighten us to reflect on the necessity of constructing a complete datastore for kNN-MT. To this end, for each test input, we deeply investigate the involved sentence pairs during the decoding process of kNN-MT, and the results are shown in Table 1 . We collect the minimum sentence pairs that cover all decoding steps of kNN-MT on multi-domain German-to-English datasets, including IT, Medical and Law. Interestingly, we discover that during the whole decoding process, a scarce number of training samples (i.e., less than 16 samples on average) in the domain-specific reference corpus are required for each test sentence, while the storage overhead of the datastore built by these samples is negligible. Moreover, combining the dynamic datastore in kNN-MT is capable of achieving superior performance than vanilla kNN-MT. This phenomenon provides a new perspective that it is feasible to perform sentence-level retrieval to obtain a small number of similar samples for kNN-MT generation. Following this line, we propose a simple and scalable nearest neighbor machine translation framework (SK-MT) to handle the limitations of kNN-MT while keeping its performance. This approach first leverages current efficient text retrieval mechanisms, such as BM25 (Robertson & Zaragoza, 2009) , to obtain a small number of reference samples that have high overlaps with the input sentence, and then dynamically construct a tiny datastore by forwarding the samples to the pre-trained model. In this way, we avoid searching through the whole datastore for nearest neighbors and drastically improve the decoding and storage efficiency of kNN-MT. Based on that, a distance-aware adapter is introduced to adaptively incorporate the kNN retrieval results into the pre-trained NMT models. We conduct experiments in two settings, including static domain adaptation and online learning. Experimental results demonstrate that our proposed approach not only achieves almost 90% speed as the NMT model without performance degradation, but also significantly loosens the storage requirements of kNN-MT. Our code is open-sourced on https://github.com/dirkiedai/sk-mt.

2. BACKGROUND: kNN-MT

Recently, Khandelwal et al. (2021) proposed kNN-MT that augments pre-trained NMT models with a translation memory retriever, empowering models to attend to external knowledge and improve translation performance. It is generally formulated in two processes: datastore construction and inference with kNN retrieval. Datastore Construction. The datastore is a translation memory which converts bilingual sentence pairs into a set of key-value pairs. Given a reference corpus (x, y) ∈ (X , Y), the pre-trained NMT model generates the context representation f θ (x, y <t ) at each timestep t. Then we collect the output hidden state f θ (x, y <t ) as key and y t as value to construct the whole datastore (K, V): (K, V) = (x,y)∈(X ,Y) {(f θ (x, y <t ), y t ), ∀y t ∈ y}. (1)  I yt=vi exp( -d(h i , f θ (x, ŷ<t )) 2 τ ), where the d(., .) stands for Euclidean distance function and τ is the temperature to control the sharpness of softmax function. The final prediction distribution enhances vanilla NMT distribution p NMT with the retrieval distribution p kNN , and it is formally calculated as: p(y t |x, ŷ<t ) = λp kNN (y t |x, ŷ<t ) + (1 -λ)p NMT (y t |x, ŷ<t ), where λ is a tuned interpolation coefficient. In consideration of beam search in the inference phase, the kNN search over the entire datastore (K, V) is applied at each decoding step of all beams, resulting in the total time complexity of O(|K|Bl), where B denotes the beam size and l denotes the target length. Moreover, the storage requirement of datastore leads to the space complexity of O(|K|h), where h stands for the dimension of hidden representations. The decisive factor in time and space complexity is the datastore size |K|, therefore the growing of |K| will engender massive storage overhead and high latency. In practice, kNN-MT usually adopts the FAISS (Johnson et al., 2021) toolkit for efficient similarity approximate search and clustering of dense vectors. However, when given an extensive datastore (e.g., billions of records), decoding and storage efficiency is still unsatisfactory (Khandelwal et al., 2021) .

3. METHODOLOGY

In this work, we provide a new perspective that leverages sentence-level retrieval to dynamically build an extremely small datastore for each test sentence, instead of constructing a complete datastore for kNN-MT. It is inspired by an interesting phenomenon that only a few training samples in the domain-specific reference corpus are involved during the decoding process of kNN-MT for each test input, as shown in Table 1 . We design a simple and scalable nearest neighbor machine translation framework (SK-MT) to handle the shortcoming of kNN-MT while keeping its performance in fast domain adaptation. As illustrated in Figure 1 , the overall framework can be decomposed into two main processes, i.e., dynamic datastore construction and inference with adaptive kNN retrieval. Next, we will give a detailed view of these two modules.

3.1. DYNAMIC DATASTORE CONSTRUCTION

In our framework, adopting a dynamic datastore instead of a static and extensive datastore used in conventional kNN-MT plays a vital role in fast domain adaptation. Its benefits are mainly two-fold: 1) it filters out the majority of noises in the original datastore in advance and forces the model to attend to the most helpful records, and 2) it limits the nearest neighbor search to an extremely small datastore, thus improving inference efficiency and reducing storage requirements to a large extent. Specifically, for each test sentence x, we leverage an existing search engine (e.g., ElasticSearchfoot_0 ) to retrieve the top-64 (default setup) bilingual sentence pairs {(x i , y i )} 64 i=1 with the highest relevance score from the training corpus. We implement the relevance score as BM25 (Robertson & Zaragoza, 2009) , a popular and competitive approach to text ranking tasks. Then in pursuit of selecting a better output hypothesis, we employ the following similarity to re-rank the retrieved bilingual sentences and maintain top-m bilingual reference samples {(x i , y i )} m i=1 : sim(x, x i ) = 1 - dist(x, x i ) max(|x|, |x i |) , where dist(., .) denotes the edit-distance, and x i , y i are the retrieved source and target sentences from the training data, respectively. These sentence pairs are utilized to create a small datastore (K x , V x ) for the source sentence x as Equation 1 by forwarding the trimmed corpus to the pretrained NMT model:  (K x , V x ) = 1≤i≤m {(f θ (x i , y i <t ), y i t ), ∀y i t ∈ y i }.

3.2. INFERENCE WITH ADAPTIVE kNN RETRIEVAL

Since we build a significantly small datastore from several similar bilingual sentences, it is inevitable to introduce irrelevant sub-units (chunks or words) for the nearest neighbor search. Under this circumstance, arbitrarily introducing kNN distribution with a fixed coefficient λ will take the risk of being interfered by the noises in retrieved neighbors. Moreover, it is unnecessary to incorporate kNN distribution when the pre-trained NMT model has higher confidence in predicting the next word, otherwise, it tends to harm the performance. To ease the bias of using a fixed interpolation coefficient in kNN-MT, prior works (Zheng et al., 2021a ) train a light-weight network to predict the coefficient λ, which decides the engagement of the kNN module in an adaptive manner. In this work, we extremely simplify the process by proposing to determine λ linear to the normalized distance, which is effective but requires no further training. Precisely, we explicitly calculate λ by: λ = g(d 0 ) = ReLU(1 - d 0 τ ), where τ is the temperature parameter defined in Equation 2 and d 0 represents the top-1 distance during nearest neighbor search. In this way, the engagement of kNN distribution is ignored when the distance is larger than the temperature value τ , and it is magnified when records are relevant enough to the current query.

3.3. DISCUSSIONS

To our knowledge, the most relevant work in kNN-MT area to our study is Fast kNN-MT (Meng et al., 2022) , which proposes to construct different datastores for each source sentence by token-level prefiltering. However, it still suffers from constructing the entire datastore in advance, resulting in less decoding and storage efficiency than our SK-MT method that adopts a dynamic datastore. SK-MT can also be viewed as a novel sentence-level pruning strategy, parallel to previous token-level pruning approaches, such as random pruning (Khandelwal et al., 2021) , greedy merging (He et al., 2021) and cluster-based pruning (Wang et al., 2022a) . Our approach benefits from the fact that the storage overhead of text data is extremely more negligible than dense vectors, and thus we provide a practical version of kNN-MT in real applications. Another main advantage of our method is its scalability in terms of reference fetching (owing to the advanced text retrieval system) and translation (owing to the scalability of reference samples). It is effortless to perform operations on a datastore such as insertions, deletions, or substitutions by operating the reference samples, while the same process is quite costly for a kNN-based method that requires a pre-defined datastore. Interestingly, the retrieval-then-generation paradigm inspired by kNN-MT analysis also falls into translation-memory area (Gu et al., 2018; Zhang et al., 2018a; Xia et al., 2019; Cai et al., 2021) . Our approach is somewhat in a similar framework of Gu et al. (2018) but introduces kNN retrieval to achieve shallow fusion. In this manner, we inherit the advantage of kNN-MT that we do not need extra training, and leverage sentence-level text retrieval from the translation-memory area to improve kNN-MT's efficiency. We also find that SK-MT achieves better or comparable performance than state-of-the-art translation-memory methods, which are included in the Appendix B.

4. EXPERIMENTS

We evaluate the effectiveness of our proposed SK-MT in two settings: 1) domain adaptation, in which a pre-trained general-domain NMT model is used to translate domain-specific sentences with kNN searching over an in-domain datastore; 2) online learning from human feedback in the humanin-the-loop machine translation, where the pre-trained NMT model incrementally takes humancorrected samples into account during the decoding process.

4.1. DOMAIN ADAPTATION

Dataset and Evaluation. For the domain adaptation task, we use the same multi-domain dataset as the baseline (Khandelwal et al., 2021) and consider IT, Medical, Koran, and Law in our experiments. The statistics of the dataset are shown in Appendix A.1. For performance evaluation, we adopt SacreBLEU (Post, 2018) foot_1 and ChrF (Popovic, 2015) . Baselines. We compare our method (SK-MT) with several baselines: 1 NMT: We adopt the WMT'19 German-English news translation task winner model (Ng et al., 2019) Main Results. In Table 2 , we verify the performance of SK-MT on the multi-domain dataset. As for SK-MT 1 that refers to only m = 2 samples instead of the whole training set as vanilla kNN-MT does, we do not notice significant performance degradation both in terms of BLEU and ChrF. Surprisingly, with a higher amount of reference samples, SK-MT 2 substantially outperforms vanilla kNN-MT by 1 point of both BLEU and ChrF on average and achieves comparable performance to AK-MT, which is regarded as the state-of-the-art kNN-MT method. It is a remarkable outcome that SK-MT 1 and SK-MT 2 surpass all the efficient methods (FK-MT, EK-MT, and CK-MT) by a large margin, illustrating that our framework is capable of attaining high effectiveness while it is designed for efficient decoding as those three approaches. Additionally, we conduct an ablation study on our adapter with dynamic parameter strategy, which indicates the benefit of adjusting λ in an adaptive manner. We also carry out experiments on a more general setting, i.e., WMT'14 news translation task, but find that kNN-based methods could not achieve significant improvements. We believe that low sentence-level similarity greatly attributes to the unfavourable performance on the WMT'14 dataset. More results and discussions are included in Appendix C. The Effect of Hyper-parameters. The translation quality of kNN-MT is susceptible to its hyperparameters λ, τ, k, which are reduced to τ and k only in our SK-MT framework. The number of reference samples m is also an essential hyper-parameter that determines the richness of reference samples and the size of the dynamic datastore. It is a straightforward intuition that the increment of m tends to generate better translations but requires more decoding time. Therefore, carefully selecting m is of vital importance because it balances the effectiveness and efficiency of our proposed method. To achieve a better trade-off, we consider 16 as the maximum of m. With grid search on the hyper-parameters, as illustrated in Table 4 and Figure 2 , we can see that our proposed SK-MT 2 achieves the best performance on the IT development set. More details can be found in Appendix A.3. It is worthwhile to notice that SK-MT 1 where k = 1 and m = 2 achieves comparable performance to the best SK-MT 2 with conceptually improved efficiency, which will be verified experimentally in the next section. Decoding Speed and Storage Overhead. In this section, we conduct experiments on comparing the inference speedfoot_2 and storage overhead with other baselines. Considering the completeness and fairness of speed comparison, we test the speeds of all the models with various batch sizes, including 1,4,8, and 16. The hardware we use is 112 cores of Intel(R) Xeon(R) Gold 6258R CPU and a single GeForce RTX 2080 Ti GPU. As shown in Table 3 , our proposed light-weight SK-MT 1 achieves almost 90% decoding speed as the NMT model and has higher speed than kNN-MT with GPU acceleration. Not surprisingly, the decoding speed of SK-MT will decrease along with the increment of m, because typically all the reference samples are passed forward to the NMT models to construct a datastore. In addition, kNN-MT typically takes 37G of space to create its datastore, while this storage requirement is avoidable for SK-MT. Both the time and space efficiency of our method are further amplified when given an extensive datastore (e.g., datastore built from WMT'14). More details can be found in Appendix C. Word Accuracy Analysis. It is observed that directly utilizing general models to translate outof-domain sentences or rare words can cause severe performance degradation (Koehn & Knowles, 2017; Dou et al., 2019) . We suspect kNN-based models alleviate this issue by introducing external information of out-of-domain and rare words to the pre-trained NMT models. To analyze this problem, we adopt COMPARE-MT (Neubig et al., 2019) to compute the accuracy of words at different frequencies (y-axis). As illustrated in Figure 3 , from the perspective of word accuracy, kNN-based methods substantially outperform NMT, especially for the rare words at the frequency ≤ 10. It indicates that the excellent adaptation ability of kNN-based models is partially owing to their richer knowledge of low-frequency words. On average, SK-MT 2 performs slightly better than SK-MT 1 , and perhaps it is because the increment of reference samples covers more rare words in the datastore.

4.2. ONLINE LEARNING

Task Description. We verify the feasibility of applying our proposed approach to machine translation with human feedback, a canonical incremental adaptation task. In this scenario, human experts are involved in post-editing translation hypotheses, and in turn, the revised translations are fed to the MT system for adaptation and improvement. One of the greatest challenges to previous work is that those approaches generally require adapting the online models to new-coming samples constantly, which conduces to significant inefficiency in practice. Worse still, they are suffering from catastrophic forgetting problems. As for our framework, we follow the research line of non-parametric online learning methods (Wang et al., 2022b) , which adopt external knowledge to memorize human feedback. Dataset and Baselines. For the online learning task, we adopt two widely-used document-level datasets, i.e., European Medicines Agency (EMEA) dataset (Tiedemann, 2009) and JRCAcquis corpus (Steinberger et al., 2006) . Following previous work (Wang et al., 2022b) , we divide the documents into five buckets based on their lengths (0-50, 50-100, 100-200, 200-500 and 500-1000) . We compare our method with several representative kinds of research, including NMT (Ng et al., 2019) , kNN-MT (Khandelwal et al., 2021) and KoK (Wang et al., 2022b) . More details of dataset and implementation are shown in Appendix A.1 and A.2. [50, [100, [200, [500, Full [0, [50, [100, [200, [500 Main Results. We adopt the same performance evaluation as in domain adaptation: BLEU and ChrF. Considering the fairness and completeness of the results, we also measure corpus-level BLEU and ChrF by concatenating the translations of all the documents in each bucket. As illustrated in Table 5 , the improvement of KoK over the kNN-MT baseline suggests that using a dynamic coefficient λ for interpolation helps produce better translation quality, which is also claimed in Wang et al. (2022b) . With the benefit of an adaptive λ, SK-MT 2 not only exceeds kNN-MT by a large margin as expected, but achieves equivalent or even better performance than KoK on both EMEA and JRC-Acquis datasets. Similar to domain adaptation, SK-MT 1 is better than kNN-MT in terms of BLEU and ChrF when utilized in online learning setting. Moreover, the improvements in various buckets of length demonstrate the effectiveness and generalization of our method. Zero-Shot and Few-Shot Ability. Following Wang et al. (2022b) , we evaluate the adaptation speed of different approaches to human feedback using R-indicator (Simianer et al., 2019) , which measures the translation recall of words with different occurrence times in users' feedback. For the j-th pair of bilingual pairs of sentences from the corpus (x j , y j ) ∈ (X , Y), we let R i,j represent unique words in the reference y j that are their (i + 1)-th occurrence in the whole document. We denote R i as the recall of tokens that have appeared i times in the previous corrected sentences: R i = |N | j=1 |H j ∩ R i,j | |N | j=1 |R i,j | where N denotes the corpus size |(X , Y)| and H j stands for unique words in the j-th hypothesis ŷj . Specifically, R 0 evaluates the tokens that first appear in the translating document and R 1 considers those that have appeared once. We conduct experiments on documents with [50, 100), [100, 200), [200, 500) and [500, 1000) buckets from EMEA and JRC-Acquis datasets, and compute R 0 , R 1 , R 2∼5 , R 5∼9 and R 9+ . As shown in Figure 5 , in all the settings, SK-MT shows roughly the same trend as KoK, where R i value of SK-MT improves rapidly and outperforms NMT and kNN-MT baselines. It indicates that SK-MT and KoK adapt to helpful human feedback faster.

5. RELATED WORK

Domain Adaptation for NMT. Domain adaptation aspires to cope with the discrepancy across domain distributions and improve the generality of the trained models. The domain adaptation approaches in the MT field are mainly divided into two categories: 1) architecture-centric, which typically adds trainable parameters to the NMT model for adaptation. (Wang et al., 2017; Wuebker et al., 2018; Bapna & Firat, 2019; Guo et al., 2021) 2) data-centric, which fine-tunes the NMT model using the domain-specific corpora. In the absence of a pre-defined in-domain corpus, the parallel data can be selected from a larger generic corpus (Del et al., 2021) , generated from a monolingual corpus through forward-or back-translation (Zhang et al., 2018b; Poncelas & Way, 2019; Wei et al., 2020) , or synthesized from a lexicon or a template (Hu et al., 2019; Peng et al., 2020) . Recently, non-parametric methods provide a new paradigm for domain adaptation by retrieving the datastore of similar instances (Khandelwal et al., 2021; Zheng et al., 2021a; Jiang et al., 2021) . We follow this research line and propose a more effective and efficient kNN-based framework. Nearest Neighbor Translation. Non-parametric approaches that incorporate external knowledge into the pre-trained models through a retrieval pipeline have attracted wide attention from natural language processing areas, including language modeling (Khandelwal et al., 2020) , machine translation (Khandelwal et al., 2021; Zheng et al., 2021a; Jiang et al., 2021) , question answering (Guu et al., 2020; Lewis et al., 2020; Xiong et al., 2021) and dialogue generation (Fan et al., 2021; Thulke et al., 2021) . For the NMT system, Khandelwal et al. (2021) first propose kNN-MT, a non-parametric approach that plugs kNN classifier over a large datastore with traditional NMT models (Bahdanau et al., 2015; Vaswani et al., 2017; Hassan et al., 2018) to achieve significant improvement. In addition, Zheng et al. (2021a) propose to dynamically determine the number of retrieved tokens to consider at each step. Martins et al. (2022a) attempt to retrieve chunks of tokens from the datastore, instead of a single token. Due to its scalability and adaptability, kNN-MT have also shown great potential for unsupervised domain adaptation (Zheng et al., 2021b; Du et al., 2022) and online learning (Wang et al., 2022b) . Despite its great success, kNN-MT suffers from large storage overhead and low decoding speed. Recently, several works are proposed to promote its practicality. Meng et al. (2022) design Fast kNN-MT to reduce the decoding time, in which they construct different datastores for each source sentence by searching for the neighbours of the source tokens. Martins et al. (2022b) investigate adaptive retrieval, datastore pruning and dimension reduction proposed in He et al. (2021) , and further design a retrieval distributions cache to speed-up decoding. Wang et al. (2022a) adopt a lightweight neural network to reduce the dimension of keys and further leverage cluster-based pruning to reduce retrieval redundancy. Different from previous methods, we propose to construct a dynamic datastore with a extremely minor size, which avoids kNN search over the entire datastore and thus dramatically improves the time and space efficiency.

A EXPERIMENTAL DETAILS AND MORE ANALYSIS

A.1 DATASET STATISTICS The statistics of the datasets included in domain adaptation (multi-domain dataset) and online learning (EMEA and JRC-Acquis datasets) tasks are listed in Table 6 and 7 , respectively. The Moses toolkit is used to tokenize the sentences and split the words into sub-word units (Sennrich et al., 2016) with the bpe-codes provided by Ng et al. (2019) . As for the online learning task, we consider kNN-MT and KoK (Wang et al., 2022b) . KoK introduces two k-nearest-neighbor (kNN) modules: Token-kNN memorizes the human feedback, which is the correct sentence provided by human translators, while Policy-kNN balances the usage of the history of human feedback and original NMT models adaptively. To replicate KoK, we follow the setup in Wang et al. (2022b) and set the K for Token-kNN and Policy-kNN to 8. The whole process of the online learning task is as follows: we initialize our reference corpus as empty and incrementally add the corresponding bilingual sentences to the corpus after every source sentence is translated. In this manner, we simulate the human-in-the-loop scenario where the translation system can only attend to the previous human-corrected sentences. When translating the following sentence, we perform the same text retrieval procedure as in the domain adaptation setting for our method. As for kNN-MT and KoK, we add the corresponding bilingual sentences to the datastore after every source sentence is translated, which is also a simulation of the online learning scenario.

A.3 HYPER-PARAMETERS SELECTION

We report the results of adopting a variety of hyper-parameters to our proposed SK-MT model on the development set. We consider grid search on τ ∈ {5, 10, 20, 50, 100, 150, 200}, k ∈ {1, 2, 3, 4} and m ∈ {1, 2, ..., 16}. The optimal choices of the multi-domain dataset, including IT, Medical, Koran and Law, are shown in Table 8 and 9 . 10 shows the performance comparisons of kNN-based models with the base model fine-tuned on the domain-specific datasets. It can be seen that kNN-based methods do not maintain equivalent translation quality to fine-tuned models, underperforming roughly 1 point of BLEU and ChrF on average. However, their domain adaptation performance is acceptable because they do not adapt the pre-trained models for better generation. As for the machine translation with human feedback task, we online update the pre-trained NMT model with human-corrected sentences, and the results are illustrated in Table 11 . We can observe that KoK and SK-MT outperform NMT with online tuning in all the buckets of length with no extra constant model updating, which verifies the effectiveness and efficiency of involving non-parametric approaches in the online learning scenario. is very helpful in promoting kNN-MT's efficiency. Furthermore, to give a more profound insight into this phenomenon, following Zhang et al. (2018a) , we measure the similarity between a test sentence x and the training corpus D train by computing the sentence similarities between x and the retrieved source sentences as sim(x, D train ) = max xi∈Dtrain sim(x, x i ) The analysis listed in Table 15 demonstrates that most of the test sentences (nearly 75%) have similarities of less than 0.4 to the training set, suggesting the similarity between training and test sets in the WMT'14 considerably We believe the low similarity greatly attributes to the unfavourable performance of the WMT'14 dataset. And also, the different distribution of similarities between the WMT'14, JRC-Acquis and IT datasets can be used to explain the performance discrepancy listed in Table 2 , 13 and 14. Thus, high similarity contributes much to kNN-based methods, which are similar to TM-based methods, and it also gives some insights into the working mechanism of kNN-based methods.



https://github.com/elastic/elasticsearch https://github.com/mjpost/sacrebleu In our experiment, the preliminary time spent on text retrieval of SK-MT is negligible. CONCLUSION AND FUTURE WORKIn this paper, we present SK-MT, a simple and scalable nearest neighbor machine translation approach for fast domain adaptation. By constructing a dynamic datastore and introducing a distanceaware adapter for inference, we are able to produce equivalent or even superior performance than previous kNN-based approaches. Moreover, experimental results on domain adaptation and online learning settings demonstrate that our framework does not require any extra training and is efficient in both decoding time and storage overhead. It is promising that our proposed SK-MT has a wide range of applications not limited to kNN-MT discussed in the paper. In the future, we would like to explore the feasibility of SK-MT when applied in kNN-based methods such as kNN-LM(Khandelwal et al., 2020), or other sequence-to-sequence tasks.



In this way, the time and space complexity of kNN-MT are reduced to O(|K x |Bl) and O(|K x |h) respectively, where |K x | ≪ |K|.

as the pre-trained NMT model. 2 kNN-MT: FollowingKhandelwal et al. (2021), we incorporate the datastore built from domain datasets into the pre-trained model. 3 AK-MT: A variant of kNN-MT(Zheng et al.,  2021a), where the hyper-parameter k is adaptively selected. 4 FK-MT: Fast kNN-MT proposed byMeng et al. (2022) constructs a smaller datastore for each source sentence by searching for the nearest token-level neighbors. 5 EK-MT: An efficient kNN-MT proposed by Martins et al. (2022b), which explores several approaches for acceleration. 6 CK-MT: Chunk-based kNN-MT (Martins et al., 2022a) retrieves chunks of tokens from the datastore, instead of a single token.Implementation Details. In our experiments, we adopt the THUMT(Zhang et al., 2020) toolkit for our model implementation. For each sentence in the test set, we retrieve 64 sentences with the highest BM25 scores from the training corpus and perform re-ranking according to the metric defined in Equation4. We maintain the top-m bilingual sentences as our reference corpus, which are utilized to build a dynamic datastore with the hidden dimension set to 1024. During inference, we carefully tune the hyper-parameters on the development set by performing a grid search on k ∈ {1, 2, 3, 4}, m ∈ {1, 2, 4, 8, 16} and τ ∈ {5, 10, 20, 50, 100, 150, 200}. Based on the validation results, we select two widely-used model architectures in our experiments, m = 2, k = 1 as SK-MT 1 and m = 16, k = 2 as SK-MT 2 , where the temperature τ -s are both set to 100. The beam size and length penalty are set to 4 and 0.6 for all datasets. To replicate other kNN-based baselines, we utilize FAISS(Johnson et al., 2021) for efficient kNN retrieval, and we learn 4096 cluster centroids for kNN retrieval and search 32 clusters for each target token.

Figure 2: Temperature selection on IT development set.

Figure 3: Word accuracy on multi-domain dataset.

Figure 4: An illustration of the scenario of machine translation with human feedback.

Figure 5: Results of R-indicator on documents with [50,100), [100,200), [200, 500), and [500.1000) buckets from EMEA and JRC-Acquis.

Inference with kNN Retrieval. At the t-th decoding step, given the already generated words ŷ<t , the current context representation f θ (x, ŷ<t ) is leveraged to generate a retrieval distribution p kNN (y t |x, ŷ<t ) over the entire vocabulary:

Grid search on m and k on IT development set with the temperature τ fixed to 100. The * marks the two selected models (SK-MT 1 and SK-MT 2 ) in our experiments.

BLEU(↑)  and ChrF(↑) on multi-domain test sets, including IT, Medical, Koran and Law.

Storage overhead and inference speed on Law test set.

BLEU(↑)  and ChrF(↑) on EMEA and JRC-Acquis datasets.

The statistics of multi-domain dataset.

The statistics of EMEA and JRC-Acquis datasets for online learning.

Grid search on temperature τ with k fixed to 2 and m fixed to 16.

Grid search on m and k with temperature τ fixed to 100.

BLEU(↑)  and ChrF(↑) on multi-domain test sets, including IT, Medical, Koran and Law.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their helpful feedback. We appreciate Lemao Liu and Hongkun Hao for the fruitful discussions and dataset sharing. This work was done when the first author was an intern at Tencent AI Lab and supported by the grants from National Natural Science Foundation of China (No.62222213, 62072423), and the USTC Research Funds of the Double First-Class Initiative (No.YD2150002009). Tong Xu is the corresponding author.

annex

 [50, [100, [200, [500, Full [0, [50, [100, [200, [500 We discrepancies brought by different volumes of reference corpus for text retrieval on the Law dataset, and the detailed results are shown in Figure 6 . Specifically, we adopt a ratio range of (0.2, 0.4, 0.6, 0.8, 1.0) to randomly sample from the training corpus as our reference corpus for quick experiments, and the same corpus is used to build a datastore for vanilla kNN-MT. These results demonstrate that the translation quality improves steadily along with the increment of corpus scale, which verifies the effectiveness of leveraging external information in NMT. SK-MT with m = 16 achieves the highest BLEU score among the SK-MT methods and is competitive with the state-of-the-art AK-MT model in corpora with different scales (less than 1 point on average). It is noteworthy to discover that the performance gain between m = 8 and m = 16 is gradually reduced, meaning that using extensive reference samples when combined with rich external information is unnecessary. 

B COMPARISON WITH TRANSLATION-MEMORY METHODS

In order to make a comparison with previous translation memory (TM) approaches (Gu et al., 2018; Zhang et al., 2018a; Xia et al., 2019; Cai et al., 2021) , we follow up their experimental setup and evaluate the translation performance on JRC-Acquis corpus (Steinberger et al., 2006) . We obtain the datasets originally preprocessed by Gu et al. (2018) and carry out translation experiments on four language pairs, i.e., German-English (De⇔En) and Spanish-English (Es⇔En). The statistics of datasets are shown in Table 12 .We adopt the same transformer structure as Xia et al. (2019) , which contains a 6-layer Transformer encoder and a 6-layer Transformer decoder. The input dimension, FFN layer dimension and attention heads are set to 512, 2048 and 8, respectively. As shown in Table 13 , our proposed approach achieves significant improvements or comparable performance to previous TM-based methods, which also indicates the effectiveness of the kNN retrieval to achieve translation memory fusion.Published as a conference paper at ICLR 2023 C EXPERIMENTS ON WMT'14 DATASETIn this section, we carry out experiments on the WMT'14 English-German dataset consisting of 4.5M sentence pairs for model training and consider bidirectional translation in this setting. We learn joint bpe-codes at the length of 45K types and adopt the Moses toolkit to tokenize the sentences and split the words into sub-word units. For each translation direction, we train a separate Transformerbased model, and select the best model on the development set, a split of the entire sentence pairs with a ratio of 1%. We adopt SacreBLEU (Post, 2018) and ChrF (Popovic, 2015) for performance evaluation and report final results on newstest2014, which is composed of 3003 bilingual sentence pairs.As listed in Table 14 , we find that kNN-based methods show little power in boosting the translation quality when utilized in the WMT'14 translation task. We suspect that it is partially because the pre-trained model is trained on the same reference corpus, which dramatically limits the richness of external information. Regardless, these results prove that SK-MT has much smaller latency than vanilla kNN-MT if the datastore is considerable, which reveals that constructing a dynamic datastore

