CONVERGENCE ANALYSIS OF SPLIT LEARNING ON NON-IID DATA Anonymous authors Paper under double-blind review

Abstract

Split Learning (SL) is one promising variant of Federated Learning (FL), where the AI model is split and trained at the clients and the server collaboratively. By offloading the computation-intensive portions to the server, SL enables efficient model training on resource-constrained clients. Despite its booming applications, SL still lacks rigorous convergence analysis on non-IID data, which is critical for hyperparameter selection. In this paper, we first prove that SL exhibits an O(1/ √ T ) convergence rate for non-convex objectives on non-IID data, where T is the number of total iterations. The derived convergence results can facilitate understanding the effect of some crucial factors in SL (e.g., data heterogeneity and local update steps). Comparing with the convergence result of FL, we show that the guarantee of SL is worse than FL in terms of training rounds on non-IID data. The experimental results verify our theory. Some generalized conclusions on the comparison between FL and SL in cross-device settings are also reported.

1. INTRODUCTION

Federating Learning (FL) is a popular distributed learning paradigm where multiple clients collaborate to train a global model under the orchestration of one central server. There are two settings in Federating Learning (FL) (McMahan et al., 2017) including (i) cross-silo where clients are organizations and the client number is typically less than 100 and (ii) cross-device where clients are Iot devices and the client number can be up to 10 10 ( Kairouz et al., 2021) . To alleviate the computation bottleneck at resource-constrained IoT devices in the cross-device scenario, Split Learning (SL) (Gupta & Raskar, 2018; Vepakomma et al., 2018) splits the AI model to be trained at the clients and server separately. The computation-intensive portions are typically offloaded to the server, which is critical for the model training at resource-constrained devices. SL is regarded as one of the enabling technologies for edge intelligence in future networks (Zhou et al., 2019) . The comparisons of FL and SL are of practical interest for the design and deployment of intelligent networks. Existing studies focus on various aspects for their comparisons (Thapa et al., 2020; Gao et al., 2020; 2021) , e.g., in terms of learning performance (Gupta & Raskar, 2018) , computation efficiency (Vepakomma et al., 2018) , communication overhead (Singh et al., 2019) , and privacy issues (Thapa et al., 2021) . For example, with the emphasis on the learning performance comparison, Gao et al. (2020; 2021) find that SL exhibits (i) faster convergence speed than FL under IID data in terms of communication rounds; (ii) better learning performance under imbalanced data; (iii) worse learning performance under (extreme) non-IID data, etc. The difference arises from the distinct process of model updates of FL and SL. In particular, FL takes the average of the local model parameters at the end of each round; SL only trains the clients in sequence and does not average the client updates. Figure 1 plots the client drift (Karimireddy et al., 2020; Wang et al., 2020; Li et al., 2022) of FL and SL under both IID and non-IID data to visualize the update process. Under the IID setting, SL approach the global optima x * faster than FL given the sequential training mechanism. In contrast, under the non-IID setting, SL may deviate from the global optima for the same reason. Convergence analysis is critical for the performance comparison between SL and FL. Specifically, a rigorous analysis is of paramount importance for the vital research questions raised by (Gao et al., 2020) (which are only empirically evaluated but remain unsolved in theory): RQ1-"What factors affect SL performance?" and RQ2-"In which setting will the SL performance outperform FL?". A wealth of work has analyzed the convergence of FL in the cases of IID (Stich, 2018; Zhou & Cong, 2017; Khaled et al., 2020) , non-IID (Li et al., 2020; 2019; Khaled et al., 2020; Karimireddy et al., 2020) and unbalanced data (Wang et al., 2020) . However, with the distinct update process, the convergence analysis of SL has yet to be solved on non-IID data. To this end, this paper first derives a rigorous convergence results of SL and draw the comparison results of FL and SL theoretically. Main contributions. The main contributions can be summarized with respect to the two research questions above: • We prove the convergence of SL on non-IID data with the standard assumptions used in FL literaturefoot_0 with a convergence rate of O(1/ √ T ) in Section 4.2. By this, we find that the convergence of SL is affected by factors such as data heterogeneity and the number of local update steps. Experimental results verify the analysis results empirically in Section 5.2. To the best of our knowledge, this work is the first to give the convergence analysis of SL on non-IID data. • We compare FLfoot_1 and SL in theory (Section 4.3) and in practice (Section 5.3). Theoretically, the guarantee of SL is worse than FL in terms of training rounds on non-IID data. Empirically, we provide some generalized conclusions of FL and SL in cross-device settings, including (i) the best and threshold learning rate of SL is smaller than FL; (ii) the performance of SL is worse than FL when the number of local update steps is large on highly non-IID data; (iii) the performance of SL can be better than FL when choosing small the number of local update steps is small on highly non-IID data.

2. PRELIMINARIES AND ALGORITHM OF SPLIT LEARNING

As two of the most popular distributed learning frameworks, both FL and SL aim to train the global model from distributed datasets. The optimization problem of FL and SL with N clients can be given by min x∈R d f (x) := N i=1 p i f i (x) , where p i = n i /n is the ratio of local samples at client i (n and n i are the sizes of the overall dataset D and local dataset D i at client i, respectively), x is the model parameters, f (x) is the global objective, f i (x) denotes the local objective function on client i. In particular, f i (x) := E ξi∼Di [f i (x; ξ i )] = 1 ni ξi∈Di f i (x; ξ i ), where ξ i represents a data sample from the local dataset D i . SL with the global learning rate. The relay-based (sequential) training process across clients makes SL significantly different (from FL), which has been described in Algorithm 1. Considering the massive number of clients in cross-device setting, only a subset S of clients are selected for model training at each round. The update order of the selected clients can be meticulously designed or randomly determined (used in this paper). The i-th client requests and initializes with the lasted model (step 4) and then performs multiple local updates (step 5-11) 3 . After K local updates, the client will send the model parameters to the next client (i.e., the i + 1-th client). The local update process can be stated as: Initialize: x (r,0) i = x r , i = 1 x (r,K) i-1 , i > 1 (2) Update: x (r,k+1) i = x (r,k) i -η l g i (x (r,k) i ), where x (r,k) i denotes the complete local model (parameters) after the k-th local update on the local dataset D i in the r-th round, η l is the local learning rate and g i (x (r,k) i ) := ∇f i (x (r,k) i ; ξ (r,k) i ) represents the stochastic gradients over on the mini-batch ξ (r,k) i sampled randomly from D i . Note that the complete model x (r,k) i consists of the client-side model x (r,k) c,i and server-side model x (r,k) s,i , as shown in Algorithm 1. The clients and server update their models synchronously and without loss of generality, we do not highlight the model locations in the following. After the last client in S (i.e., the |S|-th client) completes its local updates, it will request the initial client-side model of the current round. The global update is conducted in both the client and server: x r+1 = x r + η g (x (r,K) |S| -x r ), where x r denotes the complete global model (parameters) in the r-th round (x r equals x (r,0) i only when i = 1, which is different from FL), η g is the global learning rate. Note in Eq. ( 4) that we propose adding the global learning rate η g against the vanilla SL algorithm (Gupta & Raskar, 2018; Vepakomma et al., 2018; Thapa et al., 2020; 2021) . The global learning rate design mechanism is originally developed in the FL setting (Karimireddy et al., 2020; Reddi et al., 2020; Wang et al., 2020) to reduce the client drift, and can be readily adopted in SL for the same function. Algorithm 1 can operate in both the centralized and peer-to-peer mode (see Appendix A) and is reduced to the vanilla SL if steps 13-14 removed. For brevity, we have omitted some unconcerned details of SL in Algorithm 1, e.g., the security and privacy settings. More details about SL can be found in Appendix A or Gupta & Raskar (2018) ; Thapa et al. (2021) . Algorithm 1 Split Learning with the Global Learning Rate Some notations: -x c denotes the client-side model (parameters) -x s denotes the server-side model (parameters) -x := [x c , x s ] denotes the complete model (parameters) 1: for round r = 0, . . . , R -1 do 2: Sample a subset S of clients and determine their update order 3: for client index i = 1, . . . , |S| do 4: Client i: Request the latest client-side model and initialize x (r,0) c,i ← x r c , i = 1 x (r,K) c,i-1 , i > 1 5: for local update step k = 0, . . . , K -1 do 6: Client i: Forward propagation and send activations to the server 7: Server: Forward propagation with activations from the client 8: Server: Back-propagation and send gradients to the client 9: Client i: Back-propagation with gradients from the server 10:

Local model updates:

Client i: Global model updates: x (r,k+1) c,i ← x (r,k) c,i -η l g i (x (r,k) c,i ) Server: x (r,k+1) s,i ← x (r,k) s,i -η l g i (x (r,k) s,i Client |S|: x r+1 c ← x r c + η g (x (r,K) c,|S| -x r c ) Server: x r+1 s ← x r s + η g (x (r,K) s,|S| -x r s ) 15: Client |S| and Server: Store the global model 16: end for

3. RELATED WORK

Variants of SL. SL is deemed as a promising paradigm for distributed model training at resourceconstrained devices, given its computational efficiency on the client side. Most existing works focus on reducing the training delay arising from the relay-based training manner in the multi-user scenario. SplitFed (Thapa et al., 2020) is one popular model parallel algorithm that combines the strengths of FL and SL, where each client has one corresponding instance of server-side model in the main server to form a pair. Each pair constitutes a complete model and conducts the local update in parallel. After each training round, the fed server collects and aggregates on the clien-side local updates. The aggregated client-side model will be disseminated to all the clients before next round. The main server does the same operations to the instances of the server-side model. SplitFedv2 (Thapa et al., 2020) , SplitFedv3 (Gawali et al., 2021) and SFLG (Gao et al., 2021) , FedSeq (Zaccone et al., 2022) are the variants of SplitFed. In particular, SFLG (Gao et al., 2021) is one generalized variants of SplitFed, combining SplitFed and SplitFedv2. Convergence analysis of SL. In the case of IID data, the convergence analysis of SL is identical to standard Minibatch-SGD (Wang et al., 2022; Park et al., 2021) , so some convergence properties of Minibatch SGD is applied to SL too. The algorithm in Han et al. (2021) reduces the latency and downlink communication on SplitFed by adding auxiliary networks at client-side for quick model updates. Their convergence analysis combines the analysis of Belilovsky et al. (2020) and FedAvg. Wang et al. (2022) proposed FedLite to reduce the uplink communication overhead by compressing activations with product quantization and provided the convergence analysis of FedLite. However, their convergence recovers that of Minibatch SGD when there is no quantization. SGD with biased gradients (Ajalloeian & Stich, 2020 ) is also related. However, it only converges to a neighborhood of the solution. Furthermore, we find that Woodworth et al. (2020a; b) compared the convergence of distributed Minibatch SGD and local SGD under homogeneous and heterogeneous settings, respectively. To differentiate our work, we show how these algorithms operate in Appendix B. As a result, the convergence of SL on non-IID data is still lacking.

4.1. ASSUMPTIONS

We make the following standard assumptions on the local objective functions {f i (x)} N i=1 . Assumption 1 (L-smooth). Each local objective function f i is L-smooth, i ∈ {1, 2, . . . , N }, i.e., there exists a constant L > 0 such that ∥∇f i (x) -∇f i (y)∥ ≤ L ∥x -y∥ for all x and y. Assumption 2 (Unbiased gradient and bounded variance). For each client i, i ∈ {1, 2, . . . , N }, the stochastic gradient g i (x) := ∇f i (x; ξ i ) is unbiased E[g i (x)] = ∇f i (x) and has bounded variance E ξi [∥g i (x) -∇f i (x)∥ 2 ] ≤ σ 2 . Assumption 3 (Bounded dissimilarity). There exist constants B ≥ 1 and G ≥ 0 such that 1 N N i=1 ∥∇f i (x)∥ 2 ≤ B 2 ∥∇f (x)∥ 2 + G 2 . In the IID case, B = 1 and G = 0, since all the local objective functions are identical to each other. In the non-IID case, B and G measure the heterogeneity of data distribution.

4.2. CONVERGENCE RESULT AND DISCUSSION

Without loss of generality, all the clients participate in the SL process (|S| = N ) and the unweighted global objective function f (x) = 1 N N i=1 f i (x) is adopted. Note that the results in the unweighted case can be readily extended to the weighted case of Eq. (1). Following the proof of Wang et al. (2020) ; Karimireddy et al. (2020) ; Khaled et al. (2020) , we give the convergence results of SL (details are in Appendix D), as follows: Theorem 1. Let Assumptions 1, 2 and 3 hold. Suppose that the local learning rate satisfies η l ≤ 1 2N KL min 1 √ 2B 2 +1 , 1 ηg . For Algorithm 1, it holds that E[∥∇f (x R )∥ 2 ] ≤ 4[f (x 0 ) -f (x * )] N Kη g η l R T1:initialization error + 12N 2 K 2 η 2 l L 2 G 2 + 6N 2 Kη 2 l L 2 σ 2 T2:client drift error + 4N η g η l Lσ 2 T3:global variance , where xR = 1 R R-1 r=0 x r is the averaged global model over the R rounds. Corollary 1. Choose η g η l = 1 L √ T and apply the result of Theorem 1. For sufficiently large T (T ≥ 4N 2 K 2 max 2B 2 +1 η 2 g , 1 ), it holds that E[∥∇f (x R )∥ 2 ] ≤ O L[f (x 0 ) -f (x * )] √ T T1:initialization error + O N 2 K 2 G 2 + N 2 Kσ 2 η 2 g T T2:client drift error + O σ 2 √ T T3:global variance , ( ) where T is the total number of iterations (i.e., N KR in SL). The upper bound of E[∥∇f (x R )∥ 2 ] consists of three types of terms: (i) initialization error, (ii) client drift error, caused by the client drift (see Lemma 4 in Appendix D.2), (iii) global variance. We can see that the result demonstrates the relationships between convergence and factors such as the number of clients, the data heterogeneity, the global/local learning rates and the local update steps. According to the result, a large η l means higher rate that the initialization error decreases at but causes large client drift error and global variance. Next, we discuss the convergence rate and the influence factors of SL in detail. Convergence rate. By Corollary 1, for sufficiently large T , the convergence rate is determined by the initialization error and global variance (see Eq. ( 6)), resulting in a convergence rate of O(1/ √ T ). We can recover the convergence of SGD (Bottou et al., 2018; Wang et al., 2020) when N = 1 and K = 1 (i.e., without the client drift error) -It is true but not shown in Theorem 1 directly, since it is complicated to write the constant details in Eq. ( 5). We defer the discussion to Appendix D.5. Effect of data heterogeneity. According to Theorem 1, when on highly non-IID data (B and G are large), a small η l is required for the convergence of SL (see the condition of Theorem 1). In addition, the client drift error also increases. As a result, large data heterogeneity harms the convergence of SL, which is consistent with the previous study (Gao et al., 2020; 2021) . Effect of K. K is the number of local update steps. An immediate question is whether we can improve the convergence by adding local update steps when R is fixed. The answer is yes. By Eq. ( 5), as K increases, the initialization error decreases and the client drift error increases, which implies that the optimal K exists. We can further get that larger data heterogeneity makes the optimal K smaller based on Eq. ( 5). This property is analogous to FL (McMahan et al., 2017) . Effect of η g . The global learning rate η g can be used to reduce the client drift without hurting the progress of SL. A large η g reduces the client drift error, hence improving the convergence rate. For example, by Corollary 1, SL shows O( LF +σ 2 √ T + N 2 K 2 G 2 +N 2 Kσ 2 T ) convergence rate if η g = 1 (F := f (x 0 ) -f (x * )); while it shows a convergence rate of O( LF +σ 2 √ T + K 2 G 2 +Kσ 2 T ) if η g = N . Some detailed analysis of these factors are deferred to Section 5.2, combined with the experiments.

4.3. COMPARISON BETWEEN FL AND SL

In this section, we first give the convergence result of FL based on our setting, and then compare the results of FL and SL to answer the second question "In which setting will the SL performance outperform FL?" theoretically. We summarize the convergence results of FL ( (Wang et al., 2020) , reproduced by Theorem 2 in Appendix D.3) and SL (Theorem 1) in Table 1 . The convergence guarantee of FL is one of the best known convergence results for the non-convex functions of FL, which makes our comparison Table 1 : Comparison of convergence results between FL (Wang et al., 2020) and SL (Theorem 1) for non-convex functions. The convergence guarantee of FL is given in the 1-st (upper bound) and 2-th (constraints on η l ) rows. The effective learning rate versions are shown in the 3-rd (FL) and 5-th (SL) row. The number of rounds required to reach ϵ accuracy is given in the 4-th (FL) and 6-th (SL) rows. F := f (x 0 ) -f (x * ). Constants (including L) are omitted. η g is set to η g = 1. FL (Wang et al., 2020 ) O F η l KR + O η 2 l K 2 G 2 + η 2 l Kσ 2 + O η l σ 2 N Constraint: η l ≤ 1 2KL min 1 √ 2B 2 +1 , 1 ηg ηFL : O F ηFLR + O η2 FL G 2 + η2 FL σ 2 K + O ηFLσ 2 N K R ϵ = O F 2 ϵ 2 + σ 4 N 2 K 2 ϵ 2 + KG 2 +σ 2 Kϵ SL ηSL : O F ηSLR + O η2 SL G 2 + η2 SL σ 2 K + O ηSLσ 2 K R ϵ = O F 2 ϵ 2 + σ 4 K 2 ϵ 2 + KG 2 +σ 2 Kϵ persuasive. Choosing η l = √ N L √ T for Theorem 2 provides the O(1/ √ N T ) convergence rate, the linear speedup in FL. The constraint on the local learning rate of SL is tougher than FL. The local learning rate of SL (see Theorem 1) has tougher constraints than FL (see the 2-nd row of Table 1 ), which indicates SL is more sensitive to the heterogeneity of data. This is significant for the selection of the learning rate of SL in practice (see the comparison experiments). Effective learning rate. We next focus on comparing FL and SL in terms of rounds. Note that this comparison (running for the same R) is fair given the same total computation cost (including the computation cost on client-side and server-side). Beginning with the observation that the convergence guarantees and constraints seem very alike if choosing η l(SL) = η l(FL) /Nη l(FL) and η l(SL) denote the local learning rate of FL and SL respectively, we define the effective learning rate ηFL := Kη g η l for FL and ηSL := N Kη g η l for SL as Karimireddy et al. (2020) ; Wang et al. (2020) did. Note that the effective learning rate is unequal for FL and SL. As a result, we obtain the convergence guarantee of the effective learning rate version exhibited in the 3-rd and 5-th rows in Table 1 . The guarantee of SL is worse than FL in terms of training rounds on non-IID data. To make a comparison, we need to choose appropriate η l for both. Considering ηFL and ηSL has the same constraints, we can choose ηFL = ηSL = 1/ √ R for both bounds (see the 3-rd and 5-th rows in Table 1). Then we get: (i) O F √ R + KG 2 +σ 2 KR + σ 2 N K √ R of FL and (ii) O F √ R + KG 2 +σ 2 KR + σ 2 K √ R of SL after R rounds, i.e., the round complexity R ϵ shown in 4-th and 6-th rows in Table 1 . Then for sufficiently large R, R ϵ is determined by the first and the second term (see the 4-th and 6-th rows). In particular, the only difference in the complexity appears in the second term (we have marked it with red), which indicates that the guarantee of SL is worse than FL in terms of rounds. However, we note that the gap is not obvious when K or σ is small. Further, considering the constants advantage of SL (It is true but not shown in Theorem 1, see Appendix D.2), the performance comparison between FL and SL is still uncertain. To make a more detailed comparison, we have conducted adequate experiments and given some generalized conclusions in Section 5.3.

5. EXPERIMENTS

Our experiment environment is ideal (without the communication and computation restrictions), nevertheless, is enough to examine our convergence theory. The convergence rate is evaluated in terms of the training rounds. We demonstrate the detailed experimental setup in Section 5.1, evaluate the effect of the factors on the performance of SL in Section 5.2, and compare FL and SL in crossdevice settings in Section 5.3. More experimental details are in Appendix E.

5.1. EXPERIMENTAL SETUP

Datasets and models. We adopt the following setups: (i) training LeNet-5 (LeCun et al., 1998) on the MNIST dataset (LeCun et al., 1998) ; (ii) training LeNet-5 on the Fashion-MNIST dataset (Xiao et al., 2017) ; (iii) training VGG-11 (Simonyan & Zisserman, 2014) on the CIFAR-10 dataset (Krizhevsky et al., 2009) . For SL, the LeNet-5 is split after the second 2D MaxPool layer, with 6% of the entire model size retained in the client; the VGG-11 is split after the third 2D MaxPool layer, with 10% of the entire model size at the client. Ideally, the split layer position has no effect on the performance of SL (Wang et al., 2022) . Effect of data heterogeneity. As shown in Figure 2a , the training loss curve of IID distribution is the lowest and most stable. When the data heterogeneity increases (C decreases from 8 to 2), SL shows worse performance. The phenomenon is in accordance with our analysis that large data heterogeneity harms the convergence of SL in Section 4.2. Effect of K. Figure 2b shows the training loss in terms of communication rounds R. SL shows the best performance when E = 2. This verifies that optimal K exists and suitable K can improve the convergence. Over-large K can even harm the convergence rate (see curves E = 8 and E = 10). Effect of η g . The effect of η g in FL has been empirically studied in Reddi et al. (2020) . They tune η l and η g by grid search. We follow a similar method to study η g of SL by choosing different combinations of η g and η l . As shown in Figure 2c , the dark green grids with high test accuracy concentrate on the left bottom triangular regions, which shows that η g η l should avoid being too large. Also note that η g cannot be set infinitely too large and has a limited range, which is also observed in FL (Reddi et al., 2020) . Thus tuning η g in the limited range is suggested. Besides, we find that η g can be introduced to other SL frameworks, like SplitFed. However, further research on η g is needed to address the issues such as how to tune η g to gain improvement? in FL and SL.

5.3. EMPIRICAL COMPARISON BETWEEN FL AND SL

We have compared FL and SL theoretically in Section 4.2. The question arises that how about the learning performance of SL in practice compared to FL? In the previous work, the same learning rates are used to evaluate the performances of SL and FL (Gao et al., 2020; 2021) . However, theoretical analyses in Section 4.2 show that the appropriate learning rate for SL may deviate from that of FL. Thus, we evaluate the performance of SL and FL with different learning rates for fair comparison. The learning rates are selected from {0.0005, 0.001, 0.005, 0.01, 0.05, 0.1}. To evaluate the effect of K on the performance comparison of FL and SL, we adopt two settings: E = 1 and E = 10. We run 1000 rounds of training on MNIST, Fashion-MNIST and 4000 rounds on CIFAR-10 dataset when E = 1; run 100 rounds of training on MNIST, Fashion-MNIST and 400 rounds on CIFAR-10 dataset when E = 10. The best and threshold learning rate of SL is smaller than FL. We refer to the learning rate making the best test accuracy and minimal learning rate making the training die as the best learning rate and the threshold learning rate. According to our theory, the tougher constraint indicates the smaller threshold learning rate of SL and the math property of Eq. ( 5) indicates a smaller best learning rate. To verify this point, we use the "best" learning rate, which makes the "best" test accuracy among the learning rates we choose, to substitute the actual best learning rate. The "threshold" learning rate is defined alike. As shown in Table 2 , the "best" and "threshold" learning rates of SL are smaller than FL, especially on highly non-IID data (e.g., α = 0.2). Furthermore, we note that the "best" and "threshold" learning rate turn small as the heterogeneity of data becomes large, which is also in accordance with our theory. Performance comparison on IID data. SL can obtain a faster convergence rate with comparable or even higher test accuracy than FL (see the left plot in Figure 3 and Table 2 ). This is identical to the conclusion in Gao et al. (2020; 2021) . Performance comparison on non-IID data. Our theory proves that the guarantee of SL is worse than FL in terms of rounds when K is large. As shown in Table 2 , almost all the "best" test accuracy of FL on very highly non-IID data (i.e., α = 0.2 and C = 2) beats that of SL when E = 10. When E (K) is large, FL converges faster than SL too (see the middle plot in Figure 3 ). However, we find that SL has a better performance than FL when E (K) is small (see the E = 1 column in Table 2 and the right plot in Figure 3 ). Even in some cases (α = 0.5 and C = 5 on CIFAR-10) when E = 10, SL is better. Figure 3 : Top-1 test accuracy of FL and SL on non-IID data in terms of rounds. We illustrate some results of CIFAR-10 from Table 2 . Table 2 : Performance comparison between FL and SL in cross-device settings. The "best" test accuracy (%) is in the "Best" accuracy (lr) column (the "best" learning rate is in the parenthesis) and the "threshold" learning rate in the "Threshold" lr column. Note that ># means that the "threshold" learning rate is larger than #. The higher "best" accuracy between FL and SL is marked in bold (excluding the results whose difference is within 1%). Dataset Distribution In SL, the complete (or full) ML model is split into two portions. The portion of the complete model maintained by the clients is called client-side model. The portion maintained by the server is called server-side model. In SL, the client-side model is owned by all the clients, while the server-side model is merely owned by the server. Any client can operate with the server to complete the model training task but can not do it independently (without the help of the server). Then we prepare some basic concepts in the paper. E = 1 E = Local update (step) The process that one client and the server cooperate to conduct the model updates, i.e., the inner loop of Algorithm 1. Note that in SL, though the model update requires communication between the client and server, the model is still trained on the local dataset. So the process is called local update (Thapa et al., 2020) . Local model The concatenation of the client-side and server-side models after each local update.

Training round

The process that all the clients (or a subset of clients selected) complete their local updates, i.e., the outer loop of Algorithm 1. It is also called a global epoch or global round.

Global model

The concatenation of the client-side and server-side models after each global round. Table 3 summarizes the notations. In particular, the superscripts and subscripts of random variables appearing in the paper have the same form -a (r,k) i , where r is the index of rounds, k is the index of local update steps, i is the index of clients. It means that the random variable a after the k-th local update on the local dataset D i in the r-th round. The random variable a can be x (model parameters), ξ (random samples) or g (stochastic gradients). The complete global model (parameters) in the r-th round x (r,k) i The complete local model (parameters) after the k-th local update on the local dataset D i in the r-th round x c , x s The client-side model, server-side model g i (x (r,k) i ) Stochastic gradients over on the mini-batch ξ (r,k) i sampled randomly from D i , also denotes as ∇f i (x (r,k) i ; ξ (r,k) i ) Two modes of SL. In fact, there are two approaches of training in SL (i) with client-side model synchronization and (ii) without client-side model synchronization (Singh et al., 2019; Duan et al., 2022) . In this paper, SL is referred to the first approach, i.e., SL with client-side model synchronization. As shown in Figure 4 , there are two modes of SL with client-side model synchronization, including (i) the peer-to-peer mode and (ii) centralized mode. The only difference between the peer-to-peer and centralized mode is the synchronization way of model parametersfoot_3 . • Peer-to-peer mode. Clients store the model parameters by themselves after the local updates. So clients get the latest model from other clients. • Centralized mode. Clients send the model parameters to the server after the local updates. So clients get the latest model from the server. The centralized mode requires more communication than peer-to-peer mode. Implementation of SL with η g in the peer-to-peer and centralized modes. In each round, the last client (client N in Figure 4 ) needs to request for the initial client-side model and conduct the global model update after completing its local update. The server will conduct the global model update synchronously. For the peer-to-peer mode, this process will cause (i) additional communication cost (the red arrow line at the left side of Figure 4 ) and (ii) additional storage cost of devices and the server (storing the global model). For the centralized mode, this process will cause (i) additional communication cost (the red arrow line at the right side of Figure 4 ) (ii) additional storage cost of the server (the client-side global model is also stored in the server). The communication cost of SL with global update is given in Table 4 based on Singh et al. (2019) . ing) . This may cause "catastrophic forgetting" issue (Gawali et al., 2021; Duan et al., 2022) . So SplitFedv3 propose to use alternate mini-batch training, where a client updates its client-side model on one mini-batch, after which the client next in order takes over (Gawali et al., 2021) , to mitigate the issue. SFLG (Gao et al., 2021) is one generalized variants of SplitFed. The clients are allocated to multiple groups. There is one server in each group. The training inside the group is identical to SplitFedv2. Then the server-side "global" models per group are aggregated (e.g., weighted averaging) to obtain the server-side global model of all groups. FedSeq (Zaccone et al., 2022) is the same as SFLG except that the training inside the group is identical to vanilla SL. More variants can be found in Thapa et al. (2021) ; Duan et al. (2022) . Convergence analysis of SL. The convergence analysis of SL is identical to standard Minibatch-SGD on IID data (Wang et al., 2022; Park et al., 2021) , so some convergence properties of Minibatch SGD is applied to SL too. The algorithm in Han et al. (2021) reduces the latency and downlink communication on SplitFed by adding auxiliary networks at client-side to generate the local loss for model updating. Their convergence analysis combines the analysis of Belilovsky et al. (2020) and FedAvg. Wang et al. (2022) proposed FedLite to reduce the uplink communication overhead by compressing activations with product quantization and provided the convergence analysis of FedLite. However, their convergence recovers that of Minibatch SGD when there is no quantization. SGD with biased gradients (Ajalloeian & Stich, 2020 ) is also related. However, it only converges to a neighborhood of the solution. Furthermore, we find that Woodworth et al. (2020a; b) compared the convergence of distributed Minibatch SGD and local SGD under homogeneous and heterogeneous settings. To differentiate our work, we show how these algorithms operate in Table 5 . For Minibatch SGD, there is no local update and each client computes K stochastic gradients at the same point x r . FL and SL make K times more updates than Minibatch SGD. However, models in FL are training in parallel, while models in SL are in sequence. ∼ D i denotes one random sample from client i for the k + 1-th local update in the r-th round.

Algorithm

Global update Local update of client i Minibatch SGD x r+1 = x r -η l 1 N K N i=1 K-1 k=0 ∇fi(x r ; ξ (r,k) i ) - - Local SGD (or FL) x r+1 = x r -η l 1 N N i=1 K-1 k=0 ∇fi(x (r,k) i ; ξ (r,k) i ) Initialize: x (r,0) i = x r Update: x (r,k+1) i = x (r,k) i -η l ∇fi(x (r,k) i ; ξ (r,k) i ) SL x r+1 = x r -η l N i=1 K-1 k=0 ∇fi(x (r,k) i ; ξ (r,k) i ) Initialize: x (r,0) i = x (r,K) i-1 Update: x (r,k+1) i = x (r,k) i -η l ∇fi(x (r,k) i ; ξ (r,k) i ) C SUMMARY OF THEORIES There are two methods to give the convergence of SL: (i) bounding the progress of all clients in one round; (ii) bounding the progress of one client in one round. The second method can be given based on the techniques of FL (Khaled et al., 2020; Karimireddy et al., 2020; Wang et al., 2020) directly. However, it only converges to a neighborhood around the stationary point of the global function. This case is similar to the biased SGD (Ajalloeian & Stich, 2020) . It is intuitive to get the conclusion since the local stochastic gradient ∇f i (x; ξ i ) (ξ i ∼ D i ) generated by local data of any client i is a biased gradient estimator of the global gradient ∇f (x) of the global function. For the first method -bounding the progress of all clients in one round, our main contribution, shows that SL can converge to the stationary point of the global function. The main theories are summarized in Table 6 . Table 6 : Summary of theories for SL. η is the effective learning rate defined in Section 4.3. η in "progress of all clients in one round" is defined as N Kη g η l while η in "progress of one client in one round" is defined as Kη l . η g = 1. F := f (x 0 ) -f (x * ).

Outline Theory

Progress of all clients in one round (η = N Kηgη l ) E[∥∇f (x r )∥ 2 ] ≤ O f (x r )-f (x r+1 ) η + O η2 G 2 + η2 σ 2 K + O ησ 2 K Thm. 1 E[∥∇f (x R )∥ 2 ] ≤ O F ηR + O η2 G 2 + η2 σ 2 K + O ησ 2 K Thm. 1 E[∥∇f (x R )∥ 2 ] ≤ O F √ T + O N 2 K 2 G 2 +N 2 Kσ 2 T + O σ 2 √ T Cor. 1 E[∥∇f (x R )∥ 2 ] ≤ O F √ R + O KG 2 +σ 2 KR + O σ 2 K √ R Cor. 2 E[∥∇f (x R )∥ 2 ] ≤ O F √ (2B 2 +1) R+1 + O F 2 3 (G 2 + σ 2 K ) 1 3 (R+1) 2 3 + O F σ 2 √ K(R+1) Cor. 3 Progress of one client in one round (η = Kη l ) E[∥∇f (x (r,0) i )∥ 2 ] ≤ O f (x (r,0) i )-f (x (r,0) i+1 ) η + O η2 G 2 + η2 σ 2 K + O ησ 2 K + O G 2 Thm. 3 1 N R R-1 r=0 N i=1 E[∥∇f (x (r,0) i )∥ 2 ] ≤ O F ηN R + O η2 G 2 + η2 σ 2 K + O ησ 2 K + O G 2 Thm. 3 D PROOF OF RESULTS

D.1 BASIC TECHNICAL LEMMAS AND NOTATIONS

Lemma 1. x 1 , . . . , x N are N vectors, then ∥x i + x j ∥ 2 ≤ 2 ∥x i ∥ 2 + 2 ∥x j ∥ 2 (7) ∥x i + x j ∥ 2 ≤ (1 + a) ∥x i ∥ 2 + (1 + 1 a ) ∥x j ∥ 2 for any a > 0, N i=1 x i 2 ≤ N N i=1 ∥x i ∥ 2 . ( ) Lemma 2 (Jensen's inequality). For any convex function f and any vectors x 1 , . . . , x N we have f 1 N N i=1 x i ≤ 1 N N i=1 f (x i ). As a special case with f (x) = ∥x∥ 2 , we obtain 1 N N i=1 x i 2 ≤ 1 N N i=1 ∥x i ∥ 2 . ( ) Lemma 3. Suppose {A k } T k=1 is a sequence of random matrices and E[A k |A k-1 , A k-2 , . . . , A 1 ] = 0, ∀k. Then, E   T k=1 A k 2 F   = T k=1 E ∥A k ∥ 2 F . Proof. This is the Lemma 2 of Wang et al. (2020) . E   T k=1 A k 2 F   = T k=1 E ∥A k ∥ 2 F + T i=1 T j=1,j̸ =i E Tr{A ⊤ i A j } (13) = T k=1 E ∥A k ∥ 2 F + T i=1 T j=1,j̸ =i Tr{E A i ⊤ A j Assume i < j. Then, using the law of total expectation, E A ⊤ i A j = E A ⊤ i E[A j |A i , . . . , A 1 ] = 0. Lemma 4 (Bounded Drift). For any local learning rate satisfying η l ≤ 1 2N KL , the client drift caused by local updates is bounded, as given by: N i=1 K-1 k=0 E x (r,k) i -x r 2 ≤ 2N 3 K 2 η 2 l σ 2 + 4N 3 K 3 η 2 l B 2 ∥∇f (x r )∥ 2 + G 2 1 -4N 2 K 2 η 2 l L 2 Proof. This proof is based on the proof of Theorem 1: Convergence of Surrogate Objective of Wang et al. (2020) . Considering x (r,k) i -x r = x (r,k) i -x (r,0) i + x (r,0) i -x (r,0) i-1 + • • • + x (r,0) 2 -x (r,0) 1 (17) = -η l k-1 t=0 g i (x (r,t) i ) -η l i-1 s=1 K-1 t=0 g s (x (r,t) s ), we have E x (r,k) i -x r 2 = η 2 l E   k-1 t=0 g i (x (r,t) i ) + i-1 s=1 K-1 t=0 g s (x (r,t) s ) 2   (19) ≤ 2η 2 l E   k-1 t=0 [g i (x (r,t) i ) -∇f i (x (r,t) i )] + i-1 s=1 K-1 t=0 [g s (x (r,t) s ) -∇f s (x (r,t) s )] 2   + 2η 2 l E   k-1 t=0 ∇f i (x (r,t) i ) + i-1 s=1 K-1 t=0 ∇f s (x (r,t) s ) 2   ( ) (9) ≤ 2iη 2 l E   k-1 t=0 [g i (x (r,t) i ) -∇f i (x (r,t) i )] 2 + i-1 s=1 K-1 t=0 [g s (x (r,t) s ) -∇f s (x (r,t) s )] 2   + 2iη 2 l E   k-1 t=0 ∇f i (x (r,t) i ) 2 + i-1 s=1 K-1 t=0 ∇f s (x (r,t) s ) 2   Applying Lemma 3 to the first term on the right hand side in Eq. ( 21) and Jensen's Inequality to the second term respectively, we get E x (r,k) i -x r 2 ≤ 2i 2 Kη 2 l σ 2 + 2iKη 2 l i s=1 K-1 t=0 E ∇f s (x (r,t) s ) 2 (22) (7) ≤ 2i 2 Kη 2 l σ 2 + 4iKη 2 l i s=1 K-1 t=0 E ∥∇f s (x r ∥ 2 + 4iKη 2 l i s=1 K-1 t=0 E ∇f s (x (r,t) s ) -∇f s (x r ) 2 (23) Asm. 1 ≤ 2i 2 Kη 2 l σ 2 + 4iKη 2 l i s=1 K-1 t=0 E ∥∇f s (x r )∥ 2 + 4iKη 2 l L 2 i s=1 K-1 t=0 E x (r,t) s -x r 2 (24) ≤ 2i 2 Kη 2 l σ 2 + 4iK 2 η 2 l N s=1 E ∥∇f s (x r )∥ 2 + 4iKη 2 l L 2 N s=1 K-1 t=0 E x (r,t) s -x r 2 E ( ) Summing up E ∥x (r,k) i -x r ∥ 2 over i and k, we get N i=1 K-1 k=0 E x (r,k) i -x r 2 ≤ 2K 2 η 2 l σ 2 N i=1 i 2 + 4K 3 η 2 l N s=1 E ∥∇f s (x r )∥ 2 N i=1 i + 4K 2 η 2 l L 2 E N i=1 i (26) ≤ 2N 3 K 2 η 2 l σ 2 + 4N 2 K 3 η 2 l N s=1 E ∥∇f s (x r )∥ 2 + 4N 2 K 2 η 2 l L 2 E ( ) Term E is equivalent to N i=1 K-1 k=0 E[∥x (r,k) i -x r ∥ 2 ] , so we can rearrange the equation and get: (1 -4N 2 K 2 η 2 l L 2 )E ≤ 2N 3 K 2 η 2 l σ 2 + 4N 2 K 3 η 2 l N s=1 E ∥∇f s (x r )∥ 2 (28) Asm. 3 ≤ 2N 3 K 2 η 2 l σ 2 + 4N 3 K 3 η 2 l B 2 ∥∇f (x r )∥ 2 + G 2 Using 4N 2 K 2 η 2 l L 2 < 1 and dividing both sides by it yields the claim of Lemma 4.

D.2 PROOF OF THEOREM 1

Proof. Beginning with Assumption 1, we have f (x r+1 ) -f (x r ) ≤ ∇f (x r ), x r+1 -x r + L 2 x r+1 -x r 2 . ( ) From Algorithm 1, we know the global model update in round r can be written as: x r+1 -x r = η g (x (r,K) N -x r ) = -η g η l N i=1 K-1 k=0 g i (x (r,k) i ). For the expectation on x r , we get E[f (x r+1 )] -f (x r ) ≤ E ∇f (x r ), -η g η l N i=1 K-1 k=0 g i (x (r,k) i ) + L 2 E   η g η l N i=1 K-1 k=0 g i (x (r,k) i ) 2   (32) = -N η g η l K-1 k=0 E ∇f (x r ), 1 N N i=1 ∇f i (x (r,k) i ) + L 2 η 2 g η 2 l E   N i=1 K-1 k=0 g i (x (r,k) i ) 2   , where we use E[g i (x)] = ∇f i (x) in the equality (see Assumption 2). For the second term on the right hand side (RHS) of Eq. ( 33), we have: E   N i=1 K-1 k=0 g i (x (r,k) i ) 2   (7) ≤ 2E   N i=1 K-1 k=0 g i (x (r,k) i ) -∇f i (x (r,k) i ) 2   + 2E   N i=1 K-1 k=0 ∇f i (x (r,k) i ) 2   ( ) (9),Lem. 3 ≤ 2E N N i=1 K-1 k=0 g i (x (r,k) i ) -∇f i (x (r,k) i ) 2 + 2E   N i=1 K-1 k=0 ∇f i (x (r,k) i ) 2   (35) Asm. 2 ≤ 2N 2 Kσ 2 + 2E   N i=1 K-1 k=0 ∇f i (x (r,k) i ) 2   . Note in Eq. ( 35) that we apply Jensen's Inequality first before Lemma 3 to the first term of RHS, since the data across clients are non-IID. However, if the data across clients are IID, we can get a tighter bound of N Kσ 2 . Then plugging Eq. ( 36) into Eq. ( 33), we have: E[f (x r+1 )] -f (x r ) ≤ -N η g η l K-1 k=0 E ∇f (x r ), 1 N N i=1 ∇f i (x (r,k) i ) + Lη 2 g η 2 l E   N i=1 K-1 k=0 ∇f i (x (r,k) i ) 2   + N 2 KLη 2 g η 2 l σ 2 (37) = - N η g η l 2 K-1 k=0   ∥∇f (x r )∥ 2 + E 1 N N i=1 ∇f i (x (r,k) i ) 2 -E 1 N N i=1 ∇f i (x (r,k) i ) -∇f (x r ) 2   + Lη 2 g η 2 l E   N i=1 K-1 k=0 ∇f i (x (r,k) i ) 2   + N 2 KLη 2 g η 2 l σ 2 , ( ) where we use the fact that 2 ⟨a, b⟩ = ∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 in the last equation. Note that - 1 2 N η g η l K-1 k=0 E   1 N N i=1 ∇f i (x (r,k) i ) 2   + Lη 2 g η 2 l E   N i=1 K-1 k=0 ∇f i (x (r,k) i ) 2   (9) ≤ - 1 2 1 N η g η l K-1 k=0 E   N i=1 ∇f i (x (r,k) i ) 2   + Lη 2 g η 2 l K K-1 k=0 E   N i=1 ∇f i (x (r,k) i ) 2   = - 1 2N η g η l (1 -2N KLη g η l ) K-1 k=0 E   N i=1 ∇f i (x (r,k) i ) 2   (39) and E   1 N N i=1 ∇f i (x (r,k) i ) -∇f (x r ) 2   = E   1 N N i=1 [∇f i (x (r,k) i ) -∇f i (x r )] 2   (11) ≤ 1 N N i=1 E ∇f i (x (r,k) i ) -∇f i (x r ) 2 Asm. 1 ≤ 1 N N i=1 L 2 E x (r,k) i -x r 2 , ( ) where the first equality of Eq. ( 40) results from the fact that ∇f (x r ) = 1 N N i=1 ∇f i (x r ). By plugging Eq. ( 39) and Eq. ( 40) into Eq. ( 38) and using 2N Kη g η l L ≤ 1, we get E[f (x r+1 )] -f (x r ) ≤ - 1 2 N Kη g η l ∥∇f (x r )∥ 2 + 1 2 L 2 η g η l N i=1 K-1 k=0 E x (r,k) i -x r 2 + N 2 KLη 2 g η 2 l σ 2 . ( ) Then we use Lemma 4: E[f (x r+1 )] -f (x r ) N Kη g η l ≤ - 1 2 1 - DB 2 1 -D ∥∇f (x r )∥ 2 + N Lη g η l σ 2 + 1 1 -D N 2 KL 2 η 2 l σ 2 + 2N 2 K 2 L 2 η 2 l G 2 , ( ) where D = 4N 2 K 2 L 2 η 2 l . Then using D ≤ 1 2B 2 +1 and B ≥ 1, we get E[f (x r+1 )] -f (x r ) N Kη g η l ≤ - 1 4 ∥∇f (x r )∥ 2 + N Lη g η l σ 2 + (1 + 1 2B 2 ) N 2 KL 2 η 2 l σ 2 + 2N 2 K 2 L 2 η 2 l G 2 (43) ≤ -1 4 ∥∇f (x r )∥ 2 + N Lη g η l σ 2 + 3 2 N 2 KL 2 η 2 l σ 2 + 3N 2 K 2 L 2 η 2 l G 2 44) Taking unconditional expectation, rearranging the terms and then averaging the above equation over r = {0, • • • , R -1}, we have 1 R R-1 r=0 E ∥∇f (x r )∥ 2 ≤ 4[f (x 0 ) -f (x * )] N Kη g η l R + 12N 2 K 2 η 2 l L 2 G 2 + 6N 2 Kη 2 l L 2 σ 2 + 4N η g η l Lσ 2 Using the fact that E ∇f (x R ) 2 ≤ 1 R R-1 r=0 E ∥∇f (x r )∥ 2 where xR = 1 R R-1 r=0 x r , we get the Eq. ( 5). Finally, we summarize the constraints: D = 4N 2 K 2 L 2 η 2 l ≤ 1 2B 2 + 1 (45) 2N Kη g η l L ≤ 1 (46) 2N KLη l ≤ 1, where the last inequality is from Lemma 4. The overall constraint is given as: η l ≤ 1 2N KL min 1 √ 2B 2 + 1 , 1 η g Now we complete the proof of Theorem 1. Corollary 2. Choose N Kη g η l = 1 √ R and apply the result of Theorem 1. For sufficiently large R, it holds that E[∥∇f (x R )∥ 2 ] ≤ O F √ R + O KG 2 + σ 2 √ KR + O σ 2 K √ R , where F := f (x 0 ) -f (x * ), R is the total rounds, xR = 1 R R-1 r=0 x r is the averaged global model over the R rounds. Corollary 3. Apply the result of Theorem 1. There exits η l , such that E[∥∇f (x R )∥ 2 ] ≤ O F (2B 2 + 1) R + 1 + O F 2 3 (G 2 + σ 2 K ) 1 3 (R + 1) 2 3 + O F σ 2 K(R + 1) , where F := f (x 0 ) -f (x * ), R is the total rounds, xR = 1 R R-1 r=0 x r is the averaged global model over the R rounds. Proof. Applying Lemma 2 (sub-linear convergence rate) of Karimireddy et al. (2020) to Eq. ( 44), we get the the claim of this corollary.

D.3 PROOF OF THEOREM 2

Theorem 2. Let Assumptions 1, 2 and 3 hold. Suppose that the local learning rate satisfies η l ≤ 1 2KL min 1 √ 2B 2 +1 , 1 ηg . For Algorithm 1, it holds that E[∥∇f (x R )∥ 2 ] ≤ 4[f (x 0 ) -f (x * )] Kη g η l R T1:initialization error + 12K(K -1)η 2 l L 2 G 2 + 6(K -1)η 2 l L 2 σ 2 T2:client drift error + 4η g η l Lσ 2 N T3:global variance , where xR = 1 R R-1 r=0 x r is the averaged global model over the R rounds. Proof. This is almost the same as Theorem 1 of Wang et al. (2020) , we reproduce here for convenience of comparison between FL and SL. The proof is similar to that of Theorem 1.

D.4 PROOF OF THEOREM 3

Here we use Assumption 4 to replace Assumption 3, one stronger assumption (than Assumption 3) used in Lian et al. (2017) . Assumption 4. There exist constants G ≥ 0 such that E i∼U ([N ]) [∥∇f i (x) -∇f (x)∥] ≤ G 2 , ( ) where i is uniformly sampled from {1, . . . , N }. In the IID case, G = 0. Theorem 3 (Progress of one client in one round). Let Assumptions 1, 2 and 3 hold. Suppose that the local learning rate satisfies η l ≤ 1 2 √ 5KL . For Algorithm 1, it holds that 1 N R R-1 r=0 N i=1 E ∇f (x (r,0) i ) 2 ≤ 4[f (x 0 ) -f (x * )] N Kη l R + 40K 2 L 2 η 2 l G 2 + 10KL 2 η 2 l σ 2 + 4Lη l σ 2 + 4G 2 , ( ) where xR = 1 R R-1 r=0 x r is the averaged global model over the R rounds. Proof. Different from Theorem 1, we bound the progeress of one client in one round. Beginning with Assumption 1, we have: E f (x (r,0) i+1 ) -f (x (r,0) i ) ≤ E ∇f (x (r,0) i ), x (r,0) i+1 -x (r,0) i + L 2 E x (r,0) i+1 -x (r,0) i 2 From Algorithm 1, we know the local update of client i in round r can be written as: x (r,0) i+1 -x (r,0) i = -η l K-1 k=0 g i (x (r,k) i ). For the expectation on x (r,0) i , we get E f (x (r,0) i+1 ) -f (x (r,0) i ) ≤ E ∇f (x (r,0) i ), -η l K-1 k=0 g i (x (r,k) i ) + L 2 E   η l K-1 k=0 g i (x (r,k) i ) 2   (56) = -η l K-1 k=0 E ∇f (x (r,0) i ), ∇f i (x (r,k) i ) + L 2 η 2 l E   K-1 k=0 g i (x (r,k) i ) 2   , where we use E[g i (x)] = ∇f i (x) in the equality (see Assumption 2). For the second term on the right hand side (RHS) of Eq. ( 57), we have: E   K-1 k=0 g i (x (r,k) i ) 2   (7) ≤ 2E   K-1 k=0 g i (x (r,k) i ) -∇f i (x (r,k) i ) 2   + 2E   K-1 k=0 ∇f i (x (r,k) i ) 2   (58) Lem. 3 ≤ 2E K-1 k=0 g i (x (r,k) i ) -∇f i (x (r,k) i ) 2 + 2E   K-1 k=0 ∇f i (x (r,k) i ) 2   (59) Asm. 2 ≤ 2Kσ 2 + 2E   K-1 k=0 ∇f i (x (r,k) i ) 2   . Then plugging Eq. ( 60) into Eq. ( 57), we have: E f (x (r,0) i+1 ) -f (x (r,0) i ) ≤ -η l K-1 k=0 E ∇f (x (r,0) i ), ∇f i (x (r,k) i ) + Lη 2 l E   K-1 k=0 ∇f i (x (r,k) i ) 2   + KLη 2 l σ 2 (61) = - η l 2 K-1 k=0 ∇f (x (r,0) i ) 2 + E ∇f i (x (r,k) i ) 2 -E ∇f i (x (r,k) i ) -∇f (x (r,0) i ) 2 + Lη 2 l E   K-1 k=0 ∇f i (x (r,k) i ) 2   + KLη 2 l σ 2 , ( ) where we use the fact that 2 ⟨a, b⟩ = ∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 in the last equation. Note that - 1 2 η l K-1 k=0 E ∇f i (x (r,k) i ) 2 + Lη 2 l E   K-1 k=0 ∇f i (x (r,k) i ) 2   (9) ≤ - 1 2 η l K-1 k=0 E ∇f i (x (r,k) i ) 2 + Lη 2 l K K-1 k=0 E ∇f i (x (r,k) i ) 2 = - 1 2 η l (1 -2KLη l ) K-1 k=0 E ∇f i (x (r,k) i ) 2 and E ∇f i (x (r,k) i ) -∇f (x (r,0) i ) 2 = 2E ∇f i (x (r,k) i ) -∇f i (x (r,0) i ) 2 + 2E ∇f i (x (r,0) i ) -∇f (x (r,0) i ) 2 (64) Asm. 1 ≤ 2L 2 E x (r,k) i -x (r,0) i 2 + 2G 2 . ( ) By plugging Eq. ( 63) and Eq. ( 65) into Eq. ( 62) and using 2Kη l L ≤ 1, we get E f (x (r,0) i+1 ) -f (x (r,0) i ) ≤ - 1 2 Kη l ∇f (x (r,0) i ) 2 + L 2 η l K-1 k=0 E x (r,k) i -x (r,0) i 2 + Kη l G 2 + KLη 2 l σ 2 . ( )

E MORE EXPERIMENTAL DETAILS

Platform. We train LeNet-5 on MNIST and Fashion-MNIST with Nvidia GeForce RTX 3070 Ti, VGG-11 on CIFAR-10 with Nvidia 3090 Ti. The algorithms are implemented by PyTorch. We use the random seed "1234" by default. We use vanilla SGD algorithm with momentem = 0.9 and weight decay = 1e-4 as He et al. (2020) . The detailed information of the models and other information can be found in our code.

E.1 MORE RESULTS OF SL

This section is complimentary to Section 5.2 to study the factors that affects the performance of SL. We report the details of the experiments, such as η l , η g , b. Effect of data heterogeneity. Effect of η g . The experimental results on Fashion-MNIST and CIFAR-10 are shown in Figure 7 . We can see that η g can be helpful in some cases, especially when η l is small. However, there is still a big gap between theory and practice. Further research is required.

E.2 MORE COMPARISONS BETWEEN FL AND SL IN CROSS-DEVICE SETTING

The learning rates of FL and SL are selected from {0.0005, 0.001, 0.005, 0.01, 0.05, 0.1}. We report the overall results of comparisons between FL and SL on MNIST, Fashion-MNIST and CIFAR-10 datasets with different learning rates in Table 7 , Table 8 and Table 9 respectively. Table 2 in the main body are based on these three tables here. "L-M", "L-M" and "V-10" denote LeNet-5 on MNIST, LeNet-5 on Fashion-MNIST and VGG-11 on CIFAR-10 respectively. We highlight the "best" test accuracy among all chosen learning rates with blue for FL and red for SL. We underline the test accuracy of the "threshold" learning rate with blue for FL and red for SL. Table 7 : The detailed results of FL and SL with different learning rates on MNIST dataset. We average the test accuracy over the last 100 rounds from 1000 total rounds when E = 1; average the test accuracy over the last 10 rounds from 100 total rounds when E = 10.



We only show the convergence for non-convex objective functions here since SL is now often used in large deep learning models whose objective functions are possibly non-convex. Nevertheless, similar methods can be used to get the convergence for general convex functions and strongly convex functions. FedAvg is used for comparison in this work. The client and the server cooperate to conduct the local updates. Note that in SL, though the model update requires communication between the client and server, the model is still trained on the local dataset. So the process is called local update(Thapa et al., 2020). The concatenation of the client-side and server-side models after each local update is called local model. The security and privacy issue is beyond the scope of this work.



Figure 1: Illustration of the model updates of FL and SL for 2 clients and 2 local update steps during one round.

Request the initial client-side model x r c of the current round 14:

Data distribution. Both IID and non-IID datasets are considered in the experiments. For the non-IID setting, we adopt two mechanisms to generate non-IID data: (i) using a Dirichlet distribution Dir(α) to generate mildly non-IID data, where a smaller α indicates higher data heterogeneity(Hsu et al., 2019;Zhu et al., 2021); (ii) using a similar mechanism like(McMahan et al., 2017;Zhao et al., 2018) to generate the pathological non-IID data, e.g., one distribution in which most clients only contain samples from 2 (5) classes, denoted as C = 2 (5). Note that the data distribution generated by the first mechanism is unbalanced and the second is balanced. We will use "α = #" and "C = #" to represent the different data distributions in the following, where # is the parameter.5.2 EXPERIMENTAL RESULTS OF SL

Figure 2: Results of SL on MNIST dataset. In this section, we study the effects of data heterogeneity local update steps using the MNIST dataset. The training samples are assigned to 10 clients. For data heterogeneity, we use four distributions: IID, C = 8, 5, 2. For K, we use five settings, E = 1, 2, 4, 8, 10 over Dir 10 (1.0) distribution. E is the local epochs (McMahan et al., 2017), which satisfies K = max{En i /b}, where b is the mini-batch size. So E can measure the value of K with n i and b fixed. The results on Fashion-MNIST and CIFAR-10 are deferred to Appendix E.1.

Figure 4: Two modes of SL. The peer-to-peer mode is illustrated in the left side; the centralized mode is illustrated in the right side.

Figure 5: Effect of data heterogeneity. (a) MNIST, N = 10, b = 1000, η l = 0.01, η g = 1.0; (b) Fashion-MNIST, N = 10, b = 1000, η l = 0.01, η g = 1.0; (c) CIFAR-10, N = 10, b = 100, η l = 0.001, η g = 1.0. Effect of K.

Figure 7: Test accuracy for various local/global learning rates combination. Dir 10 (10.0) is used. For MNIST, N = 10, b = 1000, E = 1; We average test accuracy over the last 10 rounds from 30 total rounds; (b) For Fashion-MNIST, N = 10, b = 1000, E = 1; We average test accuracy over the last 10 rounds from 30 total rounds; (c) For CIFAR-10, N = 10, b = 100, E = 1. We average test accuracy over the last 20 rounds from 100 total rounds.

The average top-1 test accuracy over the last 10% rounds are shown in Table2, e.g., the average accuracy over the last 400 rounds of the total 4000 rounds on CIFAR-10 when E = 1.Cross-device setting. We compare the performance of FL and SL in cross-device settings, where the client number is enormous and the local dataset size is small. Specifically, (i) MNIST: the training data is split into 1000 clients; (ii) Fashion-MNIST: the training data is split into 1000 clients; (iii) CIFAR-10: the training data is split into 500 clients. The number of samples per client depends on the data distribution, e.g., 60 (MNIST), 60 (Fashion-MNIST) and 100 (CIFAR-10) samples per client in the second partition mechanism. The mini-batch size is 10 for all setups under the consideration of low computation power of IoT devices. The original test sets are used to evaluate the generalization performance of the global model (test accuracy) after each training round.

this work, we first derived the convergence guarantee of SL for non-convex objectives on non-IID data. The results reveal that the convergence of SL is affected by the factors such as data heterogeneity and local update steps. Furthermore, we compare SL against FL theoretically and empirically, ending up with the conclusions that (i) the best and threshold learning rate of SL is smaller than FL; (ii) the performance of SL is worse than FL when the number of local update steps is large on highly non-IID data; (iii) the performance of SL can be better than FL when the number of local update steps is small on highly non-IID data. Our work can bridge the gap between FL and SL, provide deep understanding of these two approaches and guide the deployment of these two in real-world applications. Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Convergence Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Comparison between FL and SL . . . . . . . . . . . . . . . . . . . . . . . . . . . More comparisons between FL and SL in cross-device setting . . . . . . . . . . .A MORE DETAILS ABOUT SPLIT LEARNINGIn this section, we provide more details about SL. More discussions can be found inGupta & Raskar  (2018);Thapa et al. (2021);Duan et al. (2022).

Summary of notations appearing in the paper.

Communication cost of FL and SL when running E local epochs for N clients. x, x c are the sizes of the complete model parameters and the client-side model parameters respectively. n, n i are the sizes of the overall dataset D and local dataset D i respectively. The size of the smashed layer is denoted as q. SL with global update need additional communication cost of one client-side model parameter each round.Variants of SL. SL is deemed as a promising paradigm for distributed model training at resourceconstrained devices, given its computational efficiency on the client side. Most existing works focus on reducing the training delay arising from the relay-based training manner in the multi-user scenario. SplitFed(Thapa et al., 2020) is one popular model parallel algorithm that combines the strengths of FL and SL, where each client has one corresponding instance of server-side model in the main server to form a pair. Each pair constitutes a complete model and conducts the local update in parallel. After each training round, the fed server collects and aggregates on the clien-side local updates. The aggregated client-side model will be disseminated to all the clients before next round. The main server does the same operations to the instances of the server-side model. However, in SplitFed, the main server is required for great computing power to support the multiple instances of server-side model. To address this issue,Thapa et al. (2020) proposed SplitFedv2. The only change in SplitFedv2 is that only one server-side model in main server. So the server-side model has to process the activations (or smashed data) of the client-side model in sequence. The model gets updated in every single forward-backward propagation, which means that the server-side model makes more updates than the client-side model. The main point of SplitFedv3 (Gawali et al., 2021) is substituting alternative client training with alternative the mini-batch training. In vanilla SL, the local model makes one or more training passes over the local dataset (alternative client train-

The update rules of Minibatch SGD, FL (Local SGD) and SL. We use the descriptions of Local SGD and Minibatch SGD inWoodworth et al. (2020b)  and notations in Section 2. ξ

annex

Then using the bounded client-drift in Wang et al. (2020) , i.e.,we can getwhere D = 4K 2 L 2 η 2 l . Then using D ≤ 1 5 ( 1 1-D ≤ 5 4 ) and B ≥ 1, we getTaking unconditional expectation, rearranging the terms and then averaging the above equation overx r , we get the Eq. ( 53). Finally, we summarize the constraints:The overall constraint is given as:Now we complete the proof of Theorem 1.

D.5 EXTREME CASES

Theorem 1 recovers the convergence of SGD when N = 1 and K = 1. Let us focus on the proof of Lemma 4. When N = 1 and K = 1, the client drift will reduce to:where x (r,0) 1 = x r (see Algorithm 1). Thus the client drift error of Eq. (5) will be removed, which recovers the result of SGD (Bottou et al., 2018) . Table 9: The detailed results of FL and SL with different learning rates on CIFAR-10 dataset. We average the test accuracy over the last 400 rounds from 4000 total rounds when E = 1; average the test accuracy over the last 40 rounds from 400 total rounds when E = 10. We don not execute the experiments whose learning rates are larger than the "threshold" learning rate ("-" in the table ). 

