CONVERGENCE ANALYSIS OF SPLIT LEARNING ON NON-IID DATA Anonymous authors Paper under double-blind review

Abstract

Split Learning (SL) is one promising variant of Federated Learning (FL), where the AI model is split and trained at the clients and the server collaboratively. By offloading the computation-intensive portions to the server, SL enables efficient model training on resource-constrained clients. Despite its booming applications, SL still lacks rigorous convergence analysis on non-IID data, which is critical for hyperparameter selection. In this paper, we first prove that SL exhibits an O(1/ √ T ) convergence rate for non-convex objectives on non-IID data, where T is the number of total iterations. The derived convergence results can facilitate understanding the effect of some crucial factors in SL (e.g., data heterogeneity and local update steps). Comparing with the convergence result of FL, we show that the guarantee of SL is worse than FL in terms of training rounds on non-IID data. The experimental results verify our theory. Some generalized conclusions on the comparison between FL and SL in cross-device settings are also reported.

1. INTRODUCTION

Federating Learning (FL) is a popular distributed learning paradigm where multiple clients collaborate to train a global model under the orchestration of one central server. There are two settings in Federating Learning (FL) (McMahan et al., 2017) including (i) cross-silo where clients are organizations and the client number is typically less than 100 and (ii) cross-device where clients are Iot devices and the client number can be up to 10 10 ( Kairouz et al., 2021) . To alleviate the computation bottleneck at resource-constrained IoT devices in the cross-device scenario, Split Learning (SL) (Gupta & Raskar, 2018; Vepakomma et al., 2018) splits the AI model to be trained at the clients and server separately. The computation-intensive portions are typically offloaded to the server, which is critical for the model training at resource-constrained devices. SL is regarded as one of the enabling technologies for edge intelligence in future networks (Zhou et al., 2019) . The comparisons of FL and SL are of practical interest for the design and deployment of intelligent networks. Existing studies focus on various aspects for their comparisons (Thapa et al., 2020; Gao et al., 2020; 2021) Convergence analysis is critical for the performance comparison between SL and FL. Specifically, a rigorous analysis is of paramount importance for the vital research questions raised by (Gao et al., 2020) (which are only empirically evaluated but remain unsolved in theory): RQ1-"What factors affect SL performance?" and RQ2-"In which setting will the SL performance outperform FL?". A wealth of work has analyzed the convergence of FL in the cases of IID (Stich, 2018; Zhou & Cong, 2017; Khaled et al., 2020) , non-IID (Li et al., 2020; 2019; Khaled et al., 2020; Karimireddy et al., 2020) and unbalanced data (Wang et al., 2020) . However, with the distinct update process, the convergence analysis of SL has yet to be solved on non-IID data. To this end, this paper first derives a rigorous convergence results of SL and draw the comparison results of FL and SL theoretically. Main contributions. The main contributions can be summarized with respect to the two research questions above: • We prove the convergence of SL on non-IID data with the standard assumptions used in FL literaturefoot_0 with a convergence rate of O(1/ √ T ) in Section 4.2. By this, we find that the convergence of SL is affected by factors such as data heterogeneity and the number of local update steps. Experimental results verify the analysis results empirically in Section 5.2. To the best of our knowledge, this work is the first to give the convergence analysis of SL on non-IID data. • We compare FLfoot_1 and SL in theory (Section 4.3) and in practice (Section 5.3). Theoretically, the guarantee of SL is worse than FL in terms of training rounds on non-IID data. Empirically, we provide some generalized conclusions of FL and SL in cross-device settings, including (i) the best and threshold learning rate of SL is smaller than FL; (ii) the performance of SL is worse than FL when the number of local update steps is large on highly non-IID data; (iii) the performance of SL can be better than FL when choosing small the number of local update steps is small on highly non-IID data.

2. PRELIMINARIES AND ALGORITHM OF SPLIT LEARNING

As two of the most popular distributed learning frameworks, both FL and SL aim to train the global model from distributed datasets. The optimization problem of FL and SL with N clients can be given by min x∈R d f (x) := N i=1 p i f i (x) , where p i = n i /n is the ratio of local samples at client i (n and n i are the sizes of the overall dataset D and local dataset D i at client i, respectively), x is the model parameters, f (x) is the global objective, f i (x) denotes the local objective function on client i. In particular, f i (x) := E ξi∼Di [f i (x; ξ i )] = 1 ni ξi∈Di f i (x; ξ i ), where ξ i represents a data sample from the local dataset D i . SL with the global learning rate. The relay-based (sequential) training process across clients makes SL significantly different (from FL), which has been described in Algorithm 1. Considering the massive number of clients in cross-device setting, only a subset S of clients are selected for model training at each round. The update order of the selected clients can be meticulously designed or randomly determined (used in this paper). The i-th client requests and initializes with the lasted model (step 4) and then performs multiple local updates (step 5-11)foot_2 . After K local updates, the client will send the model parameters to the next client (i.e., the i + 1-th client). The local update



We only show the convergence for non-convex objective functions here since SL is now often used in large deep learning models whose objective functions are possibly non-convex. Nevertheless, similar methods can be used to get the convergence for general convex functions and strongly convex functions. FedAvg is used for comparison in this work. The client and the server cooperate to conduct the local updates. Note that in SL, though the model update requires communication between the client and server, the model is still trained on the local dataset. So the process is called local update(Thapa et al., 2020). The concatenation of the client-side and server-side models after each local update is called local model.



Figure 1: Illustration of the model updates of FL and SL for 2 clients and 2 local update steps during one round.

