FEDLITE: IMPROVING COMMUNICATION EFFICIENCY IN FEDERATED SPLIT LEARNING

Abstract

In classical federated learning, clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a (small) part of the model is stored and trained on clients while the remaining (large) part of the model only stays at the servers. Unfortunately, the model partitioning employed in split learning significantly increases the communication cost compared to the classical federated learning algorithms. This paper addresses this issue by compressing the additional communication cost associated with split learning via a novel clustering algorithm and a gradient correction technique. An extensive empirical evaluation on standard image and text benchmarks shows that the proposed method can achieve up to 490× communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off.

1. INTRODUCTION

Federated learning (FL) is an emerging field that collaboratively trains machine learning models on decentralized data (Li et al., 2019; Kairouz et al., 2019; Wang et al., 2021) . One major advantage of FL is that it does not require clients to upload their data which may contain sensitive personal information. Instead, clients separately train local models on their private datasets, and the resulting locally trained model parameters are infrequently synchronized with the help of a coordinating server (McMahan et al., 2017) . While the FL framework helps alleviate data-privacy concerns for distributed training, most of existing FL algorithms critically assume that the clients have enough compute and storage resources to perform local updates on the entire machine learning model. However, this assumption does not necessarily hold in many modern applications. For example, classification problems with an extremely large number of classes (often in millions and billions) commonly arise in the context of recommender systems (Covington et al., 2016 ), information retrieval (Agrawal et al., 2013) , and language modeling (Levy & Goldberg, 2014). Here, the classification layer of a neural network itself is large enough that a typical FL client, e.g., a mobile or IoT device, cannot even store and locally update this single layer, let alone the entire neural network. Split learning (SL) is a recently proposed technique (Vepakomma et al., 2018; Thapa et al., 2022) that naturally addresses the above issue of FL. It splits the underlying model between the clients and server such that the first few layers are shared across the clients and the server, while the remaining layers are only stored at the server. The reduction of resource requirement at the clients is particularly pronounced when the last few dense layers constitute a large portion of the entire model. For instance, in a common convolutional neural network (Krizhevsky, 2014), the last two fully connected layers take 95% parameters of the entire model. In this case, if we allocate the last two layers to the server, then the client-side memory usage can be reduced by 20×. Nonetheless, one major limitation of SL is that the underlying model partitioning leads to an increased communication cost for the resulting framework. Specifically, to train the split neural network, the activations and gradients at the layer where the model is split (referred to as cut layer) need to be communicated between the server and clients at each iteration. The additional message size is in proportion to the mini-batch size as well as the activation size. As a result, the communication cost for the model training can become prohibitive Quantizer Decoder

Grad. Corrector

A mini-batch of client data Loss

Clients Server

Gradient w.r.t. quantized activation Codebook + codewords (+ labels) In order to reduce the additional communication between the clients the server, we propose to cluster similar client activations and send the cluster centroids to the server instead. This is equivalent to adding a vector quantization layer in split neural network training. The success of our method relies on a novel variant of product quantizer and a gradient correction technique. We refer the reader to Section 4 for further details. whenever the mini-batch size and the activation size of the cut layer are large (e.g., in one of our experiments, the additional message size can be 10 times larger than that of the client-side model). < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 l T y Y M d L H T j c y t a T i G V n G b T Y n A = " > A A A C C n i c b V D L S s N A F J 3 4 r P W V 6 t J N s A g u p C R S 1 G X B j c s K 9 g F N C J P p p B 0 6 k 4 S Z G 2 s J + Q N / w a 3 u 3 Y l b f 8 K t X + K 0 z c K 2 H r h w O O d e z u U E C W c K b P v b W F v f 2 N z a L u 2 U d / f 2 D w 7 N y l F b x a k k t E V i H s t u g B X l L K I t Y M B p N 5 E U i 4 D T T j C 6 n f q d R y o V i 6 M H m C T U E 3 g Q s Z A R D F r y z Y o b i G y c + 5 k L 9 A k y k u e + W b V r 9 g z W K n E K U k U F m r 7 5 4 / Z j k g o a A e F Y q Z 5 j J + B l W A I j n O Z l N 1 U 0 w W S E B 7 S n a Y Q F V V 4 2 e z 2 3 z r T S t 8 J Y 6 o n A m q l / L z I s l J q I Q G 8 K D E O 1 7 E 3 F / 7 x e C u G N l 7 E o S Y F G Z B 4 U p t y C 2 J r 2 Y P W Z p A T 4 R B N M J N O / W m S I J S a g 2 1 p I C U R e 1 q U 4 y x W s k v Z l z b m q 1 e / r 1 c Z F U U 8 J n a B T d I 4 c d I 0 a 6 A 4 1 U Q s R N E Y v 6 B W 9 G c / G u / F h f M 5 X 1 4 z i 5 h g t w P j 6 B R B i m 1 s = < / l a t e x i t > wc < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 H P 2 s O 1 O m L K 4 u K 5 V I 3 q K g W / 0 U X I = " > A A A C C n i c b V D L S s N A F J 3 4 r P W V 6 t J N s A g u p C R S 1 G X B j c s K 9 g F N C J P p p B 0 6 k 4 S Z G 2 s J + Q N / w a 3 u 3 Y l b f 8 K t X + K 0 z c K 2 H r h w O O d e z u U E C W c K b P v b W F v f 2 N z a L u 2 U d / f 2 D w 7 N y l F b x a k k t E V i H s t u g B X l L K I t Y M B p N 5 E U i 4 D T T j C 6 n f q d R y o V i 6 M H m C T U E 3 g Q s Z A R D F r y z Y o b i G y c + 5 k L 9 A k y l e e + W b V r 9 g z W K n E K U k U F m r 7 5 4 / Z j k g o a A e F Y q Z 5 j J + B l W A I j n O Z l N 1 U 0 w W S E B 7 S n a Y Q F V V 4 2 e z 2 3 z r T S t 8 J Y 6 o n A m q l / L z I s l J q I Q G 8 K D E O 1 7 E 3 F / 7 x e C u G N l 7 E o S Y F G Z B 4 U p t y C 2 J r 2 Y P W Z p A T 4 R B N M J N O / W m S I J S a g 2 1 p I C U R e 1 q U 4 y x W s k v Z l z b m q 1 e / r 1 c Z F U U 8 J n a B T d I 4 c d I 0 a 6 A 4 1 U Q s R N E Y v 6 B W 9 G c / G u / F h f M 5 X 1 4 z i In this paper, we aim to make the split neural network training communication-efficient so as to enable its widespread adoption for FL in resource-constrained settings. Our proposed solution is based on the critical observation that, given a mini-batch of data, the client does not need to communicate per-example activation vectors if the activation vectors (at the cut layer) for different examples in the mini-batch exhibit enough similarity. Thus, we propose a training framework that performs clustering of the activation vectors and only communicates the cluster centroids to the server. Interestingly, this is equivalent to adding a vector quantization layer in the middle of the split neural network (see Figure 1 for an overview of our proposed method). Our main contributions are as follows. • We propose a communication-efficient method for federated split learning (cf. Section 4). The approach employs a novel compression scheme that leverages product quantization to effectively compress the activations communicated between the clients and server. • After applying the activation quantization, the clients can only possibly receive noisy gradients from the server. The inaccurate client-side model updates lead to significant accuracy drops in our experiments. In order to mitigate this problem, we propose a gradient correction scheme for the backward pass, which plays a critical role in achieving a high compression ratio with minimal accuracy loss. • We empirically evaluate the performance of our approach on three standard FL datasets (cf. Section 5). Remarkably, we show that our approach allows for up to 490× communication reduction without significant accuracy loss. • We present a convergence analysis for the proposed method (cf. Section 4.3), which reveals a trade-off between communication reduction and convergence speed. The analysis further helps explain why the proposed gradient correction technique is beneficial. Here we note that our proposed method has potential applications beyond federated split learning as it can be applied in any learning framework that can benefit from a quantization layer. For instance, it can be used to reduce the communication overhead in two-party vertical FL (Romanini et al., 2021) , where the model is naturally split across two institutions and the data labels are generated on the server. It is also worth mentioning that the proposed method does not expose any additional client-side information to the server than vanilla split neural network training approach, such as SPLITFED (Thapa et al., 2022) . Thus, our method can also leverage existing privacy preserving mechanisms such as differential privacy (Wei et al., 2020) to provide formal privacy guarantees, though this is orthogonal to our work.



Figure1: Overview of the proposed algorithm: FEDLITE. In order to reduce the additional communication between the clients the server, we propose to cluster similar client activations and send the cluster centroids to the server instead. This is equivalent to adding a vector quantization layer in split neural network training. The success of our method relies on a novel variant of product quantizer and a gradient correction technique. We refer the reader to Section 4 for further details.

