FEDERATED NEAREST NEIGHBOR MACHINE TRANSLATION

Abstract

To protect user privacy and meet legal regulations, federated learning (FL) is attracting significant attention. Training neural machine translation (NMT) models with traditional FL algorithms (e.g., FedAvg) typically relies on multi-round model-based interactions. However, it is impractical and inefficient for translation tasks due to the vast communication overheads and heavy synchronization. In this paper, we propose a novel Federated Nearest Neighbor (FedNN) machine translation framework that, instead of multi-round model-based interactions, leverages one-round memorization-based interaction to share knowledge across different clients and build low-overhead privacy-preserving systems. The whole approach equips the public NMT model trained on large-scale accessible data with a k-nearestneighbor (kNN) classifier and integrates the external datastore constructed by private text data from all clients to form the final FL model. A two-phase datastore encryption strategy is introduced to achieve privacy-preserving during this process. Extensive experiments show that FedNN significantly reduces computational and communication costs compared with FedAvg, while maintaining promising translation performance in different FL settings.

1. INTRODUCTION

In recent years, neural machine translation (NMT) has significantly improved translation quality (Bahdanau et al., 2015; Vaswani et al., 2017; Hassan et al., 2018) and has been widely adopted in many commercial systems. The current mainstream system is first built on a large-scale corpus collected by the service provider and then directly applied to translation tasks for different users and enterprises. However, this application paradigm faces two critical challenges in practice. On the one hand, previous works have shown that NMT models perform poorly in specific scenarios, especially when they are trained on the corpora from very distinct domains (Koehn & Knowles, 2017; Chu & Wang, 2018) . The fine-tuning method is a popular way to mitigate the effect of domain drift, but it brings additional model deployment overhead and particularly requires high-quality in-domain data provided by users or enterprises. On the other hand, some users and enterprises pose high data security requirements due to business concerns or regulations from the government (e.g., GDPR and CCPA), meaning that we cannot directly access private data from users for model training. Thus, a conventional centralized-training manner is infeasible in these scenarios. In response to this dilemma, a natural way is to leverage federated learning (FL) (Li et al., 2019) that enables different data owners to train a global model in a distributed manner while leaving raw private data isolated to preserve data privacy. Generally, a standard FL workflow, such as FedAvg (McMahan et al., 2017) , contains multi-round model-based interactions between server and clients. At each round, the client first performs training on the local sensitive data and sends the model update to the server. The server aggregates these local updates to build an improved global model. This straightforward idea has been implemented by prior works (Roosta et al., 2021; Passban et al., 2022) that directly apply FedAvg for machine translation tasks and introduce some parameter pruning strategies during node communication. Despite this, multi-round model-based interactions are impractical and inefficient for NMT applications. Current models heavily rely on deep neural networks as the backbone and their parameters can reach tens of millions or even hundreds of millions, bringing vast computation and communication overhead. In real-world scenarios, different clients (i.e., users and enterprises) usually have limited computation and communication capabilities, making it difficult to meet frequent model training and node communication requirements in the standard FL workflow. Further, due to the capability differences between clients, heavy synchronization also hinders the efficacy of FL workflow. Fewer interactions may ease this problem but suffer from significant performance loss. Inspired by the recent remarkable performance of memorization-augmented techniques (e.g., the k-nearestneighbor, kNN) in natural language processing (Khandelwal et al., 2020; 2021; Zheng et al., 2021a; b) and computer vision (Papernot & Mcdaniel, 2018; Orhan, 2018) , we take a new perspective to deal with above federated NMT training problem. In this paper, we propose a novel Federated Nearest Neighbor (FedNN) machine translation framework, which equips the public NMT model trained on large-scale accessible data with a kNN classifier and integrates the external datastore constructed by private data from all clients to form the final FL model. In this way, we replace the multi-round model-based interactions in the conventional FL paradigm with the one-round encrypted memorization-based interaction to share knowledge among different clients and drastically reduce computation and communication overhead. Specifically, FedNN follows a similar server-client architecture. The server holds large-scale accessible data to construct the public NMT model for all clients, while the client leverages their local private data to yield an external datastore that is collected to augment the public NMT model via kNN retrieval. Based on this architecture, the key is to merge and broadcast all datastores built from different clients, while avoiding privacy leakage. We design a two-phase datastore encryption strategy that adopts an adversarial mode between server and clients to achieve privacy-preserving during the memorization-based interaction process. On the one hand, the server builds (K, V)-encryption model for clients to increase the difficulty of reconstructing the private text from the datastores constructed by other clients. The K-encryption model is coupled with the public NMT model to ensure the correctness of kNN retrieval. On the other hand, all clients use a shared content-encryption model for a local datastore during the collecting process so that the server can not directly access the original datastore. During inference, the client leverages the corresponding content-decryption model to obtain the final integrated datastore. We set up several FL scenarios (i.e., Non-IID and IID settings) with multi-domain English-German (En-De) translation dataset, and demonstrate that FedNN not only drastically decreases computation and communication costs compared with FedAvg, but also achieves the state-of-the-art translation performance in the Non-IID setting. Additional experiments verify that FedNN easily scales to large-scale clients with sparse data scenarios thanks to the memorization-based interaction across different clients. Our code is open-sourced on https://github.com/duyichao/FedNN-MT.

2. FEDNMT: FEDERATED NEURAL MACHINE TRANSLATION

Current commercial NMT systems are built on a large-scale corpus collected by the service provider and directly applied to different users and enterprises. However, this mode is difficult to flexibly satisfy the model customization and privacy protection requirements of users and enterprises. In this work, we focus on a more general application scenario, where users and enterprises participate in collaboratively training NMT models with the service provider, but the service provider cannot directly access the private data. Formally, this application scenario consists of |C| clients (i.e., user or enterprise) and a central server (i.e., service provider). The central server holds vast accessible translation data D s = {(x i s , y i s )} |Ds| i=1 , where x i = (x i 1 , x i 2 , ..., x i |x i | ) and y i = (y i 1 , y i 2 , ..., x i |y i | ) (for brevity, we omit the subscript s here) are text sequences in the source and target languages, respectively. The central server can easily train a public NMT

