MIXED FEDERATED LEARNING: JOINT DECENTRALIZED AND CENTRALIZED LEARNING

Abstract

Federated learning (FL) enables learning from decentralized privacy-sensitive data, with computations on raw data confined to take place at edge clients. This paper introduces mixed FL, which incorporates an additional loss term calculated at the coordinating server (while maintaining FL's private data restrictions). For example, additional datacenter data can be leveraged to jointly learn from centralized (datacenter) and decentralized (federated) training data and better match an expected inference data distribution. Mixed FL also enables offloading some intensive computations (e.g., embedding regularization) to the server, greatly reducing communication and client computation load. For these and other mixed FL use cases, we present three algorithms: PARALLEL TRAINING, 1-WAY GRADIENT TRANSFER, and 2-WAY GRADIENT TRANSFER. We perform extensive experiments of the algorithms on three tasks, demonstrating that mixed FL can blend training data to achieve an oracle's accuracy on an inference distribution, and can reduce communication and computation overhead by more than 90%. Finally, we state convergence bounds for all algorithms, and give intuition on the mixed FL problems best suited to each. The theory confirms our empirical observations of how the algorithms perform under different mixed FL problem settings. Under review as a conference paper at ICLR 2023 The centralized loss f c will differ from the federated loss f f (else it would not be useful). This can be a difference in the distributions that B c and B i are drawn from, and/or in the functional forms of f c and f i . We will present an expression that quantifies the difference between f c and f f in Section 5. We now state our mixed FL algorithms (Algorithms 1 and 2). Appendix A has a few practical details. • PARALLEL TRAINING performs a round of FEDAVG (minimizing f f ) in parallel with steps of centralized training (minimizing f c ), merges (e.g., averages) the respective updates, and repeats. Green in Algorithm 1 indicates added steps beyond FEDAVG for PARALLEL TRAINING. • 1-WAY GRADIENT TRANSFER starts a round by calculating a gradient of f c . It is sent to participating clients and summed with clients' gradients of f i during client optimization. Blue in Algorithm 2 indicates added steps beyond FEDAVG for 1-WAY GRADIENT TRANSFER. • 2-WAY GRADIENT TRANSFER is PARALLEL TRAINING with gradient sharing. Two gradients are now used, one based on f c and sent to clients (like 1-W GT), one based on f f and applied centrally. Purple in Algorithm 1 is added steps beyond PT for 2-WAY GRADIENT TRANSFER. Algorithm 1: PARALLEL TRAINING and 2-WAY GRADIENT TRANSFER (FEDAVG with added steps for PARALLEL TRAINING and further steps for 2-WAY GRADIENT TRANSFER) Input: Initial model x (0) ; CLIENTOPTIMIZER, SERVEROPTIMIZER, CENTRALOPTIMIZER, MERGEOPTIMIZER with respective learning rates η, ηs, ηc, ηm; central loss function fc (3); initial augmenting centralized/federated gradients, g(0) c , g(0) f (zeroed) for t ∈ {0, 1, . . . , T -1} do Initialize central model x (t,0) c = x (t) for central step k = 0, . . . , Kc -1 do Sample centralized batch B (k) c ; compute stochastic gradient gc(x (t,k) c Sample a subset S (t) of clients; for client i ∈ S (t) in parallel do ∆ (t) i , pi = CLIENTUPDATE (x (t) , g(t) c , CLIENTOPTIMIZER, η, t) Aggregate client changes ∆ (t) = i∈S (t) pi∆ (t) i / i∈S (t) pi Compute federated model x (t) f

1. INTRODUCTION

Federated learning (FL) (McMahan et al., 2017) is a machine learning setting where multiple 'clients' (e.g., mobile phones) collaborate to train a model under coordination of a central server. Clients' raw data are never transferred. Instead, focused updates intended for immediate aggregation are used to achieve the learning objective (Kairouz et al., 2019) . FL typically delivers model quality improvements because training examples gathered in situ by clients reflect actual inference serving requests. For example, a mobile keyboard next-word prediction model can be trained from actual SMS messages, yielding higher accuracy than a model trained on a proxy document corpus. Because of the benefits, FL has been used to train production models for many applications (Hard et al., 2018; Ramaswamy et al., 2019; Apple, 2019; Ramaswamy et al., 2020; Hartmann, 2021; Hard et al., 2022) . Building on FL, we can gain significant benefits from 'mixed FL': jointlyfoot_0 training with an additional centralized objective in conjunction with the decentralized objective of FL. Let x be model parameters to be optimized. Let f denote a mixed loss, a sumfoot_1 of a federated loss f f and a centralized loss f c : f (x) = f f (x) + f c (x) Mixed loss f might be a more useful training objective than f f for many reasons, including: Mitigating Distribution Shift by Adding Centralized Data to FL While FL helps with reducing train vs. inference distribution skew, it may not remove it completely. Examples include: training device populations that are subsets of inference device populations (e.g., training on high-end phones, for eventual use also on low-end phones), label-biased example retention on edge clients (e.g., only retaining positive examples of a binary classification task), and infrequent safety-critical example events with outsized importance (e.g., automotive hard-braking events needed to train a self-driving AI) (Anonymous, a). The benefits of FL can be achieved while overcoming remaining distribution skew by incorporating data from an additional datacenter dataset, via mixed FL. This affords a composite set of training data that better matches the inference distribution. Reducing Client Computation and Communication In representation learning, negative examples are used to push dissimilar items apart in a latent embedding space while keeping positive examples closer together (Oord et al., 2018) . In federated settings, clients' caches may have limited local negative examples, and recent work (Anonymous, b) showed this significantly degrades performance compared to centralized learning. This work also showed that using an additional loss (a regularization) to push representations apart, instead of negative examples, can resolve this performance gap. However, if done naively this requires communicating and computing over a large embedding table, introducing massive overhead for large-scale tasks. Applying mixed FL, where federated loss f f is the primary 'affinity' loss and centralized loss f c is the 'spreadout' regularization, avoids communicating the entire embedding table to clients and greatly reduces client computation. Though mixed FL can clearly be useful, an actual process to minimize f is not trivial. FL requires that clients' data stay on device, as they contain private information that possibly reveals personal identity. Moreover, centralized loss/data is expected to differ significantlyfoot_2 from client loss/data.

Contributions

• We motivate the mixed FL problem and present three algorithms for addressing it: PARALLEL TRAINING (PT), 1-WAY GRADIENT TRANSFER (1-W GT), and 2-WAY GRADIENT TRANSFER (2-W GT). These algorithms maintain the data privacy protections inherent in FL. [Section 2] • We experiment with facial attribute classification and language modeling, demonstrating that our algorithms overcome distribution shift. We match the accuracy of hypothetical 'oracle' scenarios where the entire inference distribution was colocated for training. [Section 4] • We experiment with user-embedding based movie recommendation, reducing communication overhead by 93.9% and client computation by 99.9% with no degradation in quality. [Section 4] • We state convergence bounds for each algorithm (in strongly, general, and non-convex settings), providing theoretical explanations for convergence behaviors we observe in the experiments. This indicates how the algorithms will perform on new mixed FL tasks. [Section 5]

2. ALGORITHMS

In FL, the loss function f f is an average of client loss functions f i . The client loss f i is an expectation over batches of data examples B i on client i. f f (x) = 1 N N i=1 f i (x), f i (x) = E Bi [f i (x; B i )] FEDAVG (McMahan et al., 2017 ) is a ubiquitous, heuristic FL method designed to minimize Equation 2 w.r.t. model x in a manner that allows all client data (B i ) to remain at respective clients i. Providing strong privacy protection is a major motivation for FL. Storing raw data locally on clients rather than replicating it on servers decreases the attack surface of the system. Also, using focused ephemeral updates and early aggregation follows principles of data minimization (White House Report, 2013). 4While training with loss f f via FEDAVG can yield an effective model x, this paper shows there are scenarios where 'mixing' in an additional 'centralized' loss f c proves beneficial to the training of x. Such a loss term can make use of batches of centralized data examples B c , from a datacenter dataset: f c (x) = E Bc [f c (x; B c )] = SERVEROPTIMIZER(x (t) , -∆ (t) , ηs, t) Compute federated model delta ∆ (t) f = x (t) f -x (t) Aggregate central model and federated model deltas ∆ (t) = ∆ (t) c + ∆ (t) f Update global model x (t+1) = MERGEOPTIMIZER(x (t) , -∆ (t) , ηm, t) Update augmenting centralized gradient g(t+1) 

CLIENTUPDATE:

Input: Initial client model x (t,0) i ; (possible) augmenting gradient g(t) c ; CLIENTOPTIMIZER, learning rate η; round t; client loss fi (2) Initialize client weight pi = 0 for client step k = 0, . . . , Ki -1 do Sample batch B (k) i ; compute stochastic gradient gi(x (t,k) i ; B (k) i ) of fi(x (t,k) i ); update client weight pi = pi + |B (k) i | Perform client update x (t,k+1) i = CLIENTOPTIMIZER(x (t,k) i , gi(x (t,k) i ; B (k) i ) + g(t) c , η, t) Compute client model changes ∆ (t) i = x (t,K i ) i -x (t,0) i and return ∆ (t) i , pi Algorithm 2: 1-WAY GRADIENT TRANSFER (FEDAVG (McMahan et al., 2017) with added steps) Input: Initial model x (0) ; CLIENTOPTIMIZER, SERVEROPTIMIZER with respective learning rates η, ηs; central loss function fc (3) for t ∈ {0, 1, . . . , T -1} do Sample batch B (t) c ; compute stochastic gradient gc(x (t) ; B (t) c ) of fc(x (t) ); set augmenting gradient g(t) c = gc(x (t) ; B (t) c ) Sample a subset S (t) of clients; for client i ∈ S (t) in parallel do ∆ (t) i , pi = CLIENTUPDATE (x (t) , g(t) c , CLIENTOPTIMIZER, η) (CLIENTUPDATE function defined in Algorithm 1) Aggregate client changes ∆ (t) = i∈S (t) pi∆ (t) i / i∈S (t) pi Update global model x (t+1) = SERVEROPTIMIZER(x (t) , -∆ (t) , ηs, t)

3. RELATED WORK

There are parallels between GRADIENT TRANSFER and algorithms addressing inter-client data heterogeneity in FL, like SCAFFOLD (Karimireddy et al., 2020b) or Mime (Karimireddy et al., 2020a) . Those algorithms calculate a gradient reflective of the entire federated population and transmit it to clients to reduce update variance, improving optimization on non-IID client data. GRADIENT TRANSFER calculates a gradient reflective of centralized data/loss, to augment client computations of decentralized data/loss (in 2-W GT, also the converse). However, SCAFFOLD requires keeping state at the server (control variates) for each participating client, impractical in real large-scale FL systems (Kairouz et al., 2019) . 2-W GT only requires state (augmenting gradients) for two entities, and so is easily implemented. Another instance of a non-IID client approach (Zhao et al., 2018) influencing mixed FL is the EXAMPLE TRANSFER algorithm (Anonymous, a). Here, centralized examples are sent to federated clients, instead of gradients. This is typically precluded in real FL applications. The data volume needed to transfer may be excessive, and in general EXAMPLE TRANSFER does not offer solutions for one of the main motivations of this work, reducing client computation and communication. 'Transfer learning' also involves two different training datasets, but has a different purpose. Transfer learning pretrains a model on a distribution (e.g., centralized data in a datacenter), then fine-tunes it on the actual distribution of interest (e.g., decentralized data via FL). It is desirable as a way to quickly train on the latter distribution (e.g., as in Ro et al. ( 2022)). But its sequential approach results in catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; French, 1999) ; accuracy on pretraining data is lost as the model learns to fit fine-tuning data instead. In mixed FL, we desire good inference performance on all data distributions trained on. See Appendix B.4.5 for more. In differentially private (DP) optimization, a line of work has aimed to improve privacy/utility tradeoffs by utilizing additional non-private data. One way is to use non-private data to pretrain (Abadi et al., 2016) . Another avenue is to use non-private data to learn the gradient geometry (Zhou et al., 2020; Amid et al., 2021; Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022) , improving accuracy by enabling tighter, non-isotropic gradient noise during DP optimization. Amid et al. (2021) and Li et al. (2022) consider the FL use casefoot_4 . As in transfer learning, additional data is used only to improve performance on a single distribution, and retaining accuracy on other distributions is a nongoal (in contrast to mixed FL). Also, the non-private data used is generally matching (in distribution) to the private data, whereas in mixed FL we typically explicitly leverage distinct distributions. A few recent works (Yu et al., 2020; Elbir et al., 2021; Jeong et al., 2021) present instances of mixed FL problems; they shift computations to the server that are intensive or impossible at the clients. These works do not address distribution shift or present more general mixed FL algorithms.

4. EXPERIMENTS

We now present experiments on three tasks (Table 1 ), showing a range of problems for which mixed FL is useful. They also show the comparative performance of each algorithm described in Section 2. for this severe label imbalance is to apply mixed FL, utilizing an additional datacenter dataset of unsmiling faces to train a capable classifier. To experiment, CelebA datafoot_5 (Liu et al., 2015; Caldas et al., 2018) is split into a federated 'smiling' dataset and centralized 'unsmiling' dataset. Figures 1a and 2a show the AUC and loss convergence of our three algorithms when applied to this problem. Empirically, we observe GRADIENT TRANSFER converge faster than PARALLEL TRAINING. Section 5 will provide a theoretical explanation for this behavior.

4.2. MITIGATE BIAS IN TRAIN DATA (LANGUAGE MODEL; STACK OVERFLOW, WIKIPEDIA)

Consider the problem of learning a language model like a RNN-based next character prediction model, used to make typing suggestions to a user in a mobile keyboard application. Because the ultimate inference application is on mobile phones, it is natural to train this model via FL, leveraging cached SMS text content highly reflective of inference time usage (at least for some users). However, the mobile phones participating in the federated learning of the model might be only a subset of the mobile phones for which we desire to deploy for inference. Higher-end mobile phones can disproportionately participate in FL, as their larger memory and faster processors allow them to complete client training faster. But to do well at inference, a model should make accurate predictions for users of lower-end phones as well. A purely FL approach can do an inadequate job of learning these users' usage patterns. (See Kairouz et al. (2019) for more on aspects of fairness and bias in FL.) Mixed FL overcomes this problem, by training a model jointly on federated data (representative of users of higher-end phones) and a datacenter dataset (representative of users of lower-end phones). We simulate this scenario using two large public datasets: the Stack Overflow datasetfoot_6 (Kaggle) for federated data, and the Wikipedia datasetfoot_7 (Wikimedia Foundation) for datacenter data. Figure 1b shows results. The 'only FL' scenario learns Stack Overflow (but not Wikipedia) patterns of character usage, and so has reduced accuracy (∼ 0.60) when evaluated on examples from both datasets. The We see that mixed FL results in similar loss as the more expensive baseline scenario (see Table 2 ). mixed FL algorithms demonstrate learning both: they all achieve an evaluation accuracy (∼ 0.66) comparable to an imagined 'oracle' that could centrally train on the combination of datasets. Unlike the smile classification experiment, here PARALLEL TRAINING converges about as fast as GRADIENT TRANSFER. Section 5 will provide a theoretical explanation for this behavior.

4.3. REGULARIZE EMBEDDINGS AT SERVER (MOVIE RECOMMENDATION; MOVIELENS)

The third task we study, movie recommendation, is one with an embedding regularization term as described in Section 1. As Table 1 shows, a key difference from the previous two mixed FL experimental scenarios is that here we are mixing losses with different functional forms (instead of mixing different datacenter and federated data distributions). We study this scenario by training a dual encoder representation learning model (Covington et al., 2016) for next movie prediction on the MovieLens dataset (Harper & Konstan, 2015; GroupLens) . As described in Section 1, limited negative examples can degrade representation learning performance. Previous work (Anonymous, b) proposed using losses insensitive to local client negatives to improve federated model performance. They observed significantly improved performance by using a two-part loss: (1) a hinge loss to pull embeddings for similar movies together, and (2) a spreadout regularization (Zhang et al., 2017) to push embeddings for unrelated movies apart. For clients to calculate (2), the server must communicate all movie embeddings to each client, and clients must perform a matrix multiplication over the entire embedding table. This introduces enormous computation and communication overhead when the number of movies is large. Mixed FL can alleviate this communication and computation overhead. Instead of computing both loss terms on clients, clients calculate only the hinge loss and the server calculates the expensive regularization, avoiding costly computation on each client. Also, computing the hinge loss only requires the embeddings of movies in a client's local dataset. Federated select (Charles et al., 2022) enables only those embeddings to be sent to that client, saving communication and on-client memory. Experiments show all mixed FL algorithms achieve model performance (around 0.1 for recall@10) comparable to the baseline scenario where everything is computed on the clients. Moreover, mixed FL eliminates more than 99.9% of client computation and more than 93.9% of communication (see Table 2 ). For computation and communication analysis, see Appendix B. Note that real-world models can be much larger than this movie recommendation modelfoot_8 . Without mixed FL, communicating such large models to clients and computing regularization would be impractical in large-scale settings. PARALLEL TRAINING converges slightly slower than either GRADIENT TRANSFER algorithm (Figure 3 ) but reaches the same evaluation loss at around 1500 rounds.

5.1. PRELIMINARIES

We now describe convergence properties for each mixed FL algorithm from Section 2. We assume mixed loss f has a finite minimizer (i.e., ∃ x * s.t. f (x) ≥ f (x * ) ∀ x). We assume client losses f i and centralized loss f c are β-smooth. If f i are β-smooth, federated loss f f is alsofoot_9 . For some results, we assume f i and f c are µ-convex (possibly strongly convex, µ > 0). If f i are µ-convex, f f is alsofoot_10 . For a parameter vector x, we use ∇f i (x) to denote the full gradient of f i (i.e., over all data on client i). Similarly, ∇f f (x) and ∇f c (x) denote full gradientsfoot_11 of f f and f c at x. We use g i (x) to denote an unbiased stochastic gradient of f i , calculated on a random batch B i of examples on client i. We focus on the impact to convergence when differences exist between the federated and centralized losses/data. As such, we make the following homogeneity assumption about the federated data, which simplifies the analysis and brings out the key differences. Our analysis can be easily extended to heterogeneous clients by assuming a bound on variance of the client gradients. Assumption 5.1. The federated clients have homogeneous data distributions (i.e., with examples that are drawn IID from a common data distribution), and their stochastic gradients have bounded variance. Specifically, for some σ > 0, we have for all clients i and parameter vectors x: E [g i (x)] = ∇f f (x), E g i (x) -∇f f (x) 2 ≤ σ 2 . ( ) Under such conditions, FEDAVG convergence can match SGD (  g f (x) = 1 S i∈S g i (x), E g f (x) -∇f f (x) 2 = E 1 S i∈S g i (x) -∇f f (x) 2 ≤ 1 S σ 2 (5) Let g c (x) denote a stochastic gradient of the centralized loss f c at x, calculated on a randomly sampled batch B c of centralized examples (from a datacenter dataset), with variance bounded by σ 2 c : E g c (x) -∇f c (x) 2 ≤ σ 2 c . Summarizing Equations 4-6, a client's stochastic gradient g i (x) has variance bounded by σ 2 , the federated cohort's stochastic gradient g f (x) has variance bounded by σ 2 /S, and the centralized stochastic gradient g c (x) has variance bounded by σ 2 c . Increasing client batch size |B i | reduces variance of g i (x) and g f (x), increasing cohort size S reduces variance of g f (x), and increasing central batch size |B c | reduces variance of g c (x).

Note that σ 2

/S only bounds variance within the federated data distribution, and σ 2 c only bounds variance within the central data distribution. To say something about variance across the two data distributions, we adapt the notion of 'bounded gradient dissimilarity' (or 'BGD') introduced in Karimireddy et al. (2020b) (Definition A1), and apply it to the mixed FL scenario here. Definition 5.2 (mixed FL (G, B)-BGD). There exist constants G ≥ 0 and B ≥ 1 such that ∀x: w f ∇f f (x) w f 2 + w c ∇f c (x) w c 2 ≤ G 2 + B 2 ∇f (x) 2 In the definition, w f and w c are proportions of influence (w f +w c = 1) of the federated and centralized objectives on the overall mixed optimization. (The simplest setting is w f = w c = 1 /2.)

5.2. BOUNDS

We can now state upper bounds on convergence (to an error in mixed loss smaller than ) for the respective mixed FL algorithms. For ease of comparison, the convergence bounds are summarized in Table 3 . The Theorems and Proofs of these convergence bounds are given in Appendix C. As mentioned Table 3 : Order of number of rounds required to be within of optimal mixed loss, for different mixing strategies. See Appendix C. σ 2 as in (4), σ 2 c as in ( 6), G and B as in Def. 5.2 with w f = w c = 1 /2. β is smoothness (Def. D.1), µ is convexity bound (Def. D.4). K is client local steps taken (≥ 2), S is client cohort size (per round). D and F are initial distances/loss errors, described in Appendix C. PARALLEL TRAINING 1-W GT 2-W GT µ-CONVEX (σ 2 +Sσ 2 c ) KSµ + G √ β µ √ + B 2 β µ log( 1 ) (σ 2 +KSσ 2 c ) KSµ + β µ log( 1 ) (σ 2 +Sσ 2 c ) KSµ + β µ log( 1 ) CONVEX (σ 2 +Sσ 2 c )D 2 KS 2 + G √ β 3 2 + B 2 βD 2 (σ 2 +KSσ 2 c )D 2 KS 2 + βD 2 (σ 2 +Sσ 2 c )D 2 KS 2 + βD 2 NONCONVEX (σ 2 +Sσ 2 c )βF KS 2 + G √ β 3 2 + B 2 βF (σ 2 +KSσ 2 c )βF KS 2 + βF (σ 2 +Sσ 2 c )βF KS 2 + βF previously, the analysis extends in a straightforward manner to the setting of heterogeneous clients assuming a bound on the variance of client gradients: for all x, 1 N N i=1 ∇f i (x) -∇f f (x) 2 ≤ σ 2 f for some σ f ≥ 0. Under this assumption, the bounds in Table 3 change by an additional Kσ 2 f term in the expression involving σ 2 and σ 2 c in the parenthesis on the numerator of the leading term. The derivation of these more general bounds follows on the same lines, so we omit it for brevity. Analyzing Table 3 , there are several implications to be drawn.

Significant (G,B)-BGD impedes PARALLEL TRAINING

The convergence bounds for PARAL-LEL TRAINING show a dependence on the G and B parameters from Definition 5.2. If a mixed FL problem involves a large amount of dissimilarity between the federated and centralized gradients (i.e., if G 0 or B 1), then PARALLEL TRAINING will be slower to converge than alternatives. Significant σ 2 c impedes 1-WAY GRADIENT TRANSFER 1-WAY GRADIENT TRANSFER is more sensitive to central variance σ 2 c . Unlike the other algorithms, the impact of σ 2 c on convergence scales with the number of steps K. 1-WAY GRADIENT TRANSFER requires a central batch size |B c | that is K times larger to achieve the same impact on convergence. Intuitively, this makes sense; in a round, PARALLEL TRAINING and 2-WAY GRADIENT TRANSFER sample K fresh batches during centralized optimization, while 1-WAY GRADIENT TRANSFER only samples a single central batch.

2-WAY GRADIENT TRANSFER should always converge at least as well as others

The convergence bound for 2-WAY GRADIENT TRANSFER is unaffected by gradient dissimilarity (i.e., G 0 or B 1), unlike PARALLEL TRAINING. Also, the bound for 2-WAY GRADIENT TRANSFER is less sensitive to σ 2 c than the bound for 1-WAY GRADIENT TRANSFER (as described above).

5.3. ANALYSIS OF EXPERIMENTS

PARALLEL TRAINING has a convergence bound substantially different than the GRADIENT TRANS-FER algorithms; the dependence on the BGD parameters G and B indicates there are mixed FL problems where PARALLEL TRAINING is slower to converge than GRADIENT TRANSFER (in either form). How can we know if a particular problem is one where PARALLEL TRAINING will have slower convergence? It would be useful to know G and B, but they cannot be exactly measured. G and B (Definition 5.2) are upper bounds holding ∀x, and the entire space of x cannot realistically be checked. Instead, we introduce sampled approximations to empirically estimate these upper bounds. Let x (t) be the global model at start of round t. Let ∇f ft , ∇f ct , ∇f t be approximations of federated, centralized, total gradients at round t. Considering Definition 5.2, we define Gt as a sampled approximation of G assuming B = 1, and Bt as a sampled approximation of B assuming G = 0:  ∇ff t = 1 S i∈S gi(x (t) ) , ∇fc t = gc(x (t) ), ∇f t = ∇ff t + ∇fc t G2 t = 1 wf ∇ff t 2 + 1 wc ∇fc t 2 -∇f t 2 , B2 t = 1 wf ∇ff t 2 + 1 wc ∇fc t 2 / ∇f t 2 (7)

6. CONCLUSION

This paper has introduced mixed FL, including motivation, algorithms and their convergence properties, and intuition for when a given algorithm will be useful for a given problem. Our experiments indicate mixed FL can improve accuracy and reduce communication and computation across tasks. This work focused on jointly learning from a single decentralized client population and a centralized entity, as it illuminates the key aspects of the mixed FL problem. Note that mixed FL and the associated properties we define in this paper (like mixed FL (G, B)-BGD) are easily expanded to work with multiple (> 1) distinct client populations participating. E.g., a population of mobile phones and a separate population of smart speakers, or mobile phones separated into populations with distinct capabilities/usage (high-end vs. low-end, or by country/language). Also, there need not be a centralized entity; mixing can be solely between distinct federated datasets. It is interesting to reflect on the bounds of Table 3 , and what they indicate about the benefits of separating a single decentralized client population into multiple populations for mixed FL purposes. The bounds are in terms of σ 2 (representing within population 'variability') and G and B (representing cross-population 'variability'). Splitting a population based on traits will likely decrease σ 2 (each population is now more homogeneous) but introduce or increase G and B (populations are now distinctive). This might indicate scenarios where GRADIENT TRANSFER methods (only bounded by σ 2 ) become more useful and PARALLEL TRAINING (also bound by G and B) becomes less useful. The limits of our convergence bounds should be noted. First, they are 'loose'; practical performance in particular algorithmic scenarios could be better, and thus comparisons between algorithms could differ. Second, our bounds assume IID federated data, which is invalid in practice; convergence properties differ on non-IID data. While our analysis, extended to handle non-IID data, shows that the bounds do not materially change, it is still a place where theory and practice slightly diverge. It is important to note that mixed FL is orthogonal to the choice of algorithm for federated training. In principle, mixed FL techniques are expected to have positive societal impacts insofar as they further develop the toolkit for FL (which has security and privacy benefits to users) and improve accuracy on final inference distributions. Also, we've shown (Section 4.2) how mixed FL can address participation biases that arise in FL. However, the addition of server-based data to federated optimization raises the possibility that biases in large public corpora find their way into more applications of FL.

A PRACTICAL IMPLEMENTATION DETAILS

A.1 DOWNLOAD SIZE GRADIENT TRANSFER (either 1-WAY or 2-WAY) requires sending additional data as part of the communication from server to clients at the start of a federated round. Apart from the usual model checkpoint weights, with GRADIENT TRANSFER we must now also transmit 'augmenting' gradients of the model weights w.r.t centralized data as well. Naively, this doubles the download size as the gradient is the same size as the model. However, the augmenting gradients should be amenable to compression, e.g. using an approach such as Mitchell et al. (2022) .

A.2 UPLOAD SIZE

With PARALLEL TRAINING and 1-WAY GRADIENT TRANSFER, no client gradient information is used outside of the clients themselves, so there is no additional information (apart from the model deltas and aggregation weights) to upload to the server. With 2-WAY GRADIENT TRANSFER, client gradient information is used in centralized training, so needs to be conveyed to the server somehow. When the FL client optimization is SGD, the average client gradient in a round (over all clients participating, over all steps) can be determined from the model deltas and aggregation weights that are already being sent back to the server, meaning no additional upload bandwidth is necessary. Apart from bandwidth considerations, this also means there is no additional vector for private data leakage. The algorithm to do this is as follows. Each client i transmits to the server a local model change ∆ i and an aggregation weight p i that is related to number of steps taken K i . The average total gradient applied at client i during round t is: ḡ(t) = - 1 ηK i ∆ (t) i The average client gradient (i.e., w.r.t. just client data) at client i is: ḡ(t) i = - 1 ηK i ∆ (t) i -g(t) c ( ) where g(t) c is the augmenting centralized gradient that was calculated from centralized data and used in round t. The average (across the cohort) of average client gradients, weighted by K i , is: ḡ(t) f = - 1 η i K i i ∆ (t) i -g(t) c ( ) This average client gradient ḡ(t) f is in the spirit of SCAFFOLD (Karimireddy et al., 2020b) Equation 4, Option II. It will be used as the augmenting federated gradient g(t+1) f in the subsequent round t + 1, to augment centralized optimization. See Algorithm 1.

A.3 DEBUGGING AND HYPERPARAMETER INTUITION VIA K = 1

As these algorithms each involve different hyperparameters, validating that software implementationsfoot_12 are behaving as expected is non-trivial. Something that proved useful for debugging purposes, as well as provided practical experience in understanding equivalences between the algorithms, was to perform test cases with the number of local steps K set to 1. In this setting, the three mixed FL algorithms are effectively identical and should make equivalent progress during training. Note that the convergence bounds of Table 3 hold for K ≥ 2, so this takes us outside the operating regime where the bounds predict performance. It also takes us outside an operating regime that is typically useful (FL use cases generally find multiple steps per round to be beneficial). But it does serve a purpose when debugging. Here we provide some additional information. Figure 4 plots these sampled approximation metrics over the first 100 rounds of training. We ran 5 simulations per experiment and took the maximum at each round across simulations. We used the same hyperparameters as described below (in Subsection B.2), except taking only a single step per round (K = 1).

B.2 ADDITIONAL DETAILS FOR EXPERIMENTS IN SECTION 4

General Notes on Hyperparameter Selection For the various experiments in Section 4, we empirically determined good hyperparameter settings (as documented in Tables 5 6 7 8 9 10 ). Our general approach for each task was to leave server learning rate η s at 1, select a number of steps K that made the most use of the examples in each client's cache, and then do a sweep of client learning rates η to determine a setting that was fast but didn't diverge. For PARALLEL TRAINING and 2-WAY GRADIENT TRANSFER, which involve central optimization and merging, we set the merging learning rate η m to be 1, and set the central learning rate η c as the product of client and server learning rates: η c = ηη s (and since η s = 1, this meant client and central learning rates were equal).

General Notes on Comparing Algorithms

We generally kept hyperparameters equivalent when comparing the algorithms. For example, we aimed to set batch sizes for all algorithms such that central and client gradient variances σ 2 and σ 2 c have equivalent impact on convergence (meaning |B c | = S|B i | for PT and 2-W GT, and |B c | = KS|B i | for 1-W GT). In the case of language model training with 1-WAY GRADIENT TRANSFER, following this rubric would have meant a central batch size |B c | of 12800; we reduced this in half for practical computation reasons. For a given task, we also generally kept learning rates the same for all algorithms. Interestingly, we observed that as η (and η c , if applicable) is increased for a given task, the 2-WAY GRADIENT TRANSFER algorithm is the first of the three to diverge, and so we had to adjust, e.g., in the language modeling experiment we used a lower η for 2-W GT than for PT and 1-W GT.

B.2.1 CELEBA SMILE CLASSIFICATION

Datasets The CelebA federated dataset consists of 9,343 raw clients, which can be broken into train/evaluation splits of 8,408/935 clients, respectively (TFF CelebA documentation, 2022). The raw clients have average cache size of ∼ 21 face images. The images are about equally split between smiling and unsmiling faces. In order to enlarge cache size, we group three raw clients together into one composite client, so our federated training data involves 2,802 clients with caches of (on average) ∼ 63 face images (and about half that when we limit the clients to only have smiling faces). ( 1-W GT) (PT and 2-W GT) (all) |B c | |B c | K η c η m w f w c - - 10 = η 1.0 0.5 0.5 Our evaluation data consists of both smiling and unsmiling faces, and is meant to stand in for the inference distribution (where accurate classification of both smiling and unsmiling inputs is necessary). Note that as CelebA contains smiling and unsmiling faces in nearly equal amounts, a high evaluation accuracy cannot come at the expense of one particular label being poorly classified. Model Architecture The architecture used is a very basic fully-connected neural networkfoot_13 with a single hidden layer of 64 neurons with ReLU activations. Hyperparameter Settings The settings used in mixed FL training are shown in Tables 5 and 6 Model Architecture The architecture used is a recurrent neural network (RNN) 15 with an embedding dimension of 256 and 1024 GRU units. Hyperparameter Settings The settings used in mixed FL training are shown in Tables 7 and 8 . Note that in this experiment we also found clipping the norm of gradients (both federated and centralized) to be beneficial, for all mixed FL algorithms. We used adaptive clipping for the federated gradients (via the method described in Model Architecture The architecture used is the same as Anonymous (b), a dual encoder representation learning model with a bag-of-word encoder for the left tower (which takes in a list of movies a user has seen) and a simple embedding lookup encoder for the right tower (which takes in the next movie a user sees). Hyperparameter Settings The settings used in mixed FL training are shown in Tables 9 and 10 . In Subsection 4.3, SGD was used for all optimizers: CLIENTOPTIMIZER and SERVEROPTIMIZER, and (if PT or 2-W GT) CENTRALOPTIMIZER and MERGEOPTIMIZER. See Appendix B.4.4 for an additional experiment where ADAM is used as the SERVEROPTIMIZER in 1-W GT. Recall@10 As mentioned in Subsection 4.3, all mixed FL algorithms achieved similar global recall@10 compared to the baseline. Figure 6 shows evaluation recall@10 over 2000 training rounds.  -W GT 2-W GT Comp. (K • N 2 • d)/2 (N 2 • d)/2 (N 2 • d)/2 (N 2 • d)/2 Comm. 2 • N • d 2 • n • d 3 • n • d 3 • n • d B.

3. COMPUTATION AND COMMUNICATION SAVINGS FOR MOVIE RECOMMENDATION

This section provides a detailed analysis of the computation and communication savings brought by mixed FL in the movie recommendation task. For movie recommendation, both the input feature and the label are movie IDs with a vocabulary size of N . They share the same embedding table, with an embedding dimension of d. The input features and label embedding layers account for most of the model parameters in a dual encoder. Therefore, we use the total size of the feature and label embeddings to approximate the model size: M = (N + N ) • d. Batch size is B i and local steps per round is K. Let the averaged number of movies in each client's local dataset for each training round be n, smaller than B i • K. Computation As shown in the second row of Table 11 , the amount of computation for regularization term is (K • N 2 • d)/2 if calculating on-device (baseline). When computing the regularization term on the server (mixed FL), the complexity is (N 2 • d)/2. The total computation saving with mixed FL is ((K -1) • N 2 • d)/2. We use (N 2 • d)/2 instead of N 2 • d for regularization term computation which is more accurate for an optimized implementation. The total computation complexity of the forward pass is O(B i d + B i d 2 + B 2 i d), where the three items are for the bag-of-word encoder, the context hidden layer, and similarity calculation. The hinge loss and spreadout computation is O(B i ) + O(0.5N 2 d). The gradient computation is O(2B i d 2 + 2B 2 i d) for network backward pass and O(B i ) + O(N d) for hinge and spreadout. Therefore, when computing the regularization term on the server with mixed FL, the computation savings for each client is 1 - (B i d + 3B i d 2 + 3B 2 i d + 2B i )/(B i d + 3B i d 2 + 3B 2 i d + 2B i + 0.5N 2 d + N d) , which is 99.98% for all mixed FL algorithms. Communication The communication overheads of each algorithm are presented in the last row of Table 11 . For the baseline, the server and each client need to communicate the full embedding table and the gradients, so the communication overhead is 2 • N • d or 494KB. With PARALLEL TRAINING, the server and each client only communicate movie embeddings and the gradients corresponding to movies in that client's local datasets. Thus the communication traffic is reduced to 2 • n • d or 20KB. GRADIENT TRANSFER requires the server to send both the movie embeddings and gradients to each client. The communication overhead then becomes 3 • n • d or 30KB. Overall, mixed FL can save more than 93.9% communication overhead than the baseline. Table 3 shows that the theoretical bounds on rounds to convergence are directly proportional to the client variance bound σ 2 and central variance bound σ 2 c . Also, as discussed in Subsection 5.2, 1-WAY GRADIENT TRANSFER is more sensitive to high central variance than the other two algorithms. Whereas in the other algorithms the impact of σ 2 c on convergence scales with cohort size S, in 1-WAY GRADIENT TRANSFER it scales with cohort size S and steps taken per round K. To observe the effect of σ 2 c in practice, and compare its effect on 1-WAY GRADIENT TRANSFER vs. 2-WAY GRADIENT TRANSFER, we ran sweeps of CelebA smile classification training, varying the central batch size |B c |. The plots of evaluation loss and evaluation AUC of ROC are shown in Figures 7 (1-W GT) and 8 (2-W GT). For each central batch size setting, we ran 10 trials; the plots show the means of each setting's trials, with corresponding 95% confidence bounds. Figure 7 confirms the sensitivity of 1-WAY GRADIENT TRANSFER to central variance, with experiments using larger central batches B c converging faster than experiments using smaller central batches. However, at least in the case of this task, the benefits of lower variance disappear quickly. The convergence of AUC of ROC did not appreciably improve for central batch sizes larger than 25. Presumably there is little effect at these larger central batch sizes because in these cases the convergence is now dominated by client variance (i.e., further convergence improvements would come from increasing client batch size |B i |). Comparing Figure 8 with Figure 7 , we empirically observe that 2-WAY GRADIENT TRANSFER has lower sensitivity than 1-WAY GRADIENT TRANSFER to central batch size/central variance.

B.4.2 TRADING η FOR K

The convergence bounds of Table 3 have an additional implication, in regards to the trade off between client learning rate η (and central learning rate η c ) and number of local steps taken K. It's better to reduce η and η c and increase K, but there are limits The convergence bounds are not related to client or central learning rate (η or η c ), but are inversely related to local steps K. In general, it's best to take as many steps as possible, and if necessary reduce learning rates accordingly. But there are limits to how large K can be. First, clients have finite caches of data, and K will always be limited by cache size divided by batch size. Second, in the case of 1-WAY GRADIENT TRANSFER, any increase in K means that central variance σ 2 c must be proportionally reduced (as mentioned above), necessitating even larger central batch sizes (which at some point is infeasible). We empirically observed this relationship by running smile classification (Figure 9 ) and language modeling (Figure 10 ) experiments where client learning rate η (and central learning rate η c ) are inversely proportionally varied with K. For each hyperparameter configuration we ran 5 trials; the figures include 95% confidence intervals. The results confirm that reducing these learning rates, and making a corresponding increase in the number of steps, is beneficial. It never hurts convergence, and often helps.

B.4.3 DIFFERENCES IN EFFECTIVE STEP SIZE

Table 12 in Appendix C shows that in order to yield the convergence bounds stated in this paper, each algorithm makes different assumptions of maximum effective step size. From this we draw one final implication in regards to comparing the mixed FL algorithms. For given η, maximum K varies by algorithm, or, for given K, maximum η varies by algorithm Consider just effective federated step size η = ηη s K for the moment. Assume that server learning rate η s is held constant. Then each mixed FL algorithm has a different theoretical upper bound on the product of client learning rate η and local steps per round K. If using a common η, the theoretical upper limit on K varies by mixed FL algorithm. Alternatively if using a common K, the theoretical upper limit on η varies by mixed FL algorithm. The maximum effective step sizes of Table 12 imply that 2-WAY GRADIENT TRANSFER has narrower limits than 1-WAY GRADIENT TRANSFER on the allowable ranges of η and K. It also indicates that for PARALLEL TRAINING the allowable range of η, η c , and K depends on the B parameter from mixed FL (G, B)-BGD (Definition 5.2). Some of this behavior has been observed empirically, when hyperparameter tuning our experiments (discussed in Subsection B.2). For example, for the language modeling experiment, assuming a constant number of steps of K = 16, 2-WAY GRADIENT TRANSFER tends to diverge when learning rate η was increased beyond 1.0, whereas 1-WAY GRADIENT TRANSFER is observed to converge even with learning rate η of 5.0. (PARALLEL TRAINING is in-between; it still converges with learning rate η of 3.0, but diverges when learning rate η is 5.0.) An interesting characteristic to note is that using different η in different algorithms does not really impact comparative convergence. Figure 11 shows convergence in the language modeling experiment, when 2-WAY GRADIENT TRANSFER uses η = 1.0 and 1-WAY GRADIENT TRANSFER and PARALLEL TRAINING both use η = 3.0 (in all cases, with K = 16). The higher learning rate of 1-W GT and PT helps a little early, but does not impact the number of rounds to convergence. This holds with the theoretical convergence bounds of Table 3 , which show a relationship with steps K but not learning rates (as also discussed above). 

B.4.4 1-W GT WITH ADAPTIVE OPTIMIZATION

We briefly studied the performance of 1-WAY GRADIENT TRANSFER when using ADAM in place of SGD as the server optimizer, i.e., FEDADAM (Reddi et al., 2020) . Note that the server adaptive optimizer requires a smaller learning rate to perform well. Figure 12 reports the results of using ADAM as the server optimizer with a server learning rate of 0.01. All the other hyperparameters are the same as in Tables 9 and 10 . We observe that (1) ADAM works better than SGD, leading to better convergence, and (2) 1-WAY GRADIENT TRANSFER performs almost the same as the baseline when using ADAM. We will extend our investigation of mixed FL with adaptive optimization in the future. This will include studying methods for applying adaptive optimization to PARALLEL TRAINING and 2-WAY GRADIENT TRANSFER; these algorithms are more complicated since they involve additional optimizers (CENTRALOPTIMIZER and MERGEOPTIMIZER).

B.4.5 COMPARISON WITH TRANSFER LEARNING

Section 3 discussed how transfer learning is ill-suited for mixed FL. Here we show why. Recall the bias mitigation mixing task of Section 4.2. The end goal is a single trained language model that performs well both for high-end and low-end mobile phone users. We use Stack Overflow and Wikipedia datasets as respective experimental stand-ins, as they model the problem well. They are in the same domain (Latin alphabet) and same language (English) but with distinct usage distributions. Transfer learning sequentially trains the language model, first 'pretraining' on one dataset and then switching to 'fine-tune' on the other. One could first centrally pretrain (with Wikipedia), then switch to federated fine-tuning (with Stack Overflow), or do the converse. Figure 13 shows what happens. The switches take place at 250 rounds. We also show PARALLEL TRAINING as a point of comparison. Figure 13 shows that in transfer learning, catastrophic forgetting of the pretraining task occurs after switching to the fine-tuning task. The centralized pretraining case precipitously drops its evaluation accuracy on datacenter data (Wikipedia). The federated pretraining case does the same with federated data (Stack Overflow). Figure 13a shows that PARALLEL TRAINING achieves superior evaluation accuracy on the stand-in for the desired inference distribution (i.e. the combined Stack Overflow and Wikipedia datasets). To simplify Figure 13 we don't show GRADIENT TRANSFER results, but they would match PARALLEL TRAINING (e.g., as shown in Figures 1b and 2b ). Because the PARALLEL TRAINING and GRADIENT TRANSFER algorithms presented in this paper jointly train on both data distributions in parallel (as opposed to sequentially), they avoid catastrophic forgetting. Table 12 : Maximum effective federated step size, η = ηη s K, for convergence bounds in Appendix C and Table 3 . When applicable (PT, 2-W GT) the effective centralized step size, η c K, shares the same maximum (and assume that merging learning rate η m is 1). β is the smoothness bound (Def. D.1). PT 1-W GT 2-W GT µ-CONVEX 1 6(1+B 2 )β 1 8β min 1 81β , 1 15µ CONVEX 1 6(1+B 2 )β 1 8β 1 81β NONCONVEX 1 6(1+B 2 )β 1 18β 1 24β Table 13 : Assumptions on merging or server learning rates, for convergence bounds in Appendix C and Table 3 . PT 1-W GT 2-W GT (ASSUMES) η m ≥ 1 η s ≥ √ S η m ≥ 1

C CONVERGENCE THEOREMS

The three subsections that follow state theorems for convergence (to an error smaller than ) for the respective mixed FL algorithms. The convergence bounds are summarized in Table 3 in Section 5. Tables 12 and 13 convey some supporting aspects of the convergence bounds, about limits on effective step size (η = ηη s K) and assumptions on learning rates.

C.1 PARALLEL TRAINING

Given Assumption 5.1, one can view PARALLEL TRAINING as a 'meta-FEDAVG' involving two 'meta-clients'. One meta-client is the population of IID federated clients (collectively having loss f f ), and the other meta-client is the centralized data at the datacenter (having loss f c ). As such, we can take the convergence theorem for FEDAVG derived in Karimireddy et al. (2020b) (Section 3, Theorem I) and observe that it applies to the number of rounds T to reach convergence in the PARALLEL TRAINING scenario. Theorem C.1. For PARALLEL TRAINING, where the federated data is IID (Assumption 5.1), for β-smooth functions f f and f c which satisfy Definition 5.2, the number of rounds T to reach an expected error smaller than is: µ-Strongly convex: T = Õ (σ 2 +Sσ 2 c ) KSµ + G √ β µ √ + B 2 β µ log( 1 ) General convex: T = O (σ 2 +Sσ 2 c )D 2 KS 2 + G √ β 3 2 + B 2 βD 2 Non-convex: T = O (σ 2 +Sσ 2 c )βF KS 2 + G √ β 3 2 + B 2 βF where F = f (x (0) ) -f (x * ), D 2 = x (0) -x * 2 . Conditions for above: ηm ≥ 1; ηc, ηηs ≤ 1 6(1+B 2 )βKηm . Proof. The analysis is exactly along the lines of the analysis in Karimireddy et al. (2020b), Appendix D.2, in the context of FEDAVG. Effectively, the analysis applies to the 'meta-FEDAVG' problem of PARALLEL TRAINING, with two 'meta-clients', one being the central loss/data (with stochastic gradients with variance of σ 2 c ) and the other being the federated loss/data. The homogeneity of the clients and the averaging over the sampled clients effectively reduces the variance of the stochastic gradients to σ 2 /S. The analysis follows in a straightforward manner by accounting for the variance in appropriate places. We omit the details for brevity.

D CONVERGENCE PROOFS FOR 1-WAY GRADIENT TRANSFER

We will prove the convergence rate of 1-WAY GRADIENT TRANSFER for 3 different cases: strongly convex, general convex, and non-convex. While these proofs are influenced by those for SCAFFOLD in Karimireddy et al. (2020b) , we note some additional technical challenges we face with 1-WAY GRADIENT TRANSFER. First, as the name indicates, gradient information only flows one way (in SCAFFOLD, gradient information flows from and to all clients). Second, we only sample a batch of centralized data once per round (whereas SCAFFOLD can draw multiple batches during a round). The result is that (as described in Subsection 5.2), 1-WAY GRADIENT TRANSFER works out to be more sensitive to central variance σ 2 c and number of steps K than 2-WAY GRADIENT TRANSFER. We will first state a number of definitions and lemmas in Subsection D.1 that are needed in proving convergence rate of 1-WAY GRADIENT TRANSFER, before proceeding to the actual proofs in Subsection D.2.

D.1 ADDITIONAL DEFINITIONS AND LEMMAS

Note that some of the lemmas below are restatements of lemmas given in Karimireddy et al. (2020b) . We opt to restate here (versus referencing the relevant lemma in Karimireddy et al. (2020b) each time) due to the volume of usage of the lemmas, to ease the burden on the reader. We will first present the subset of definitions and lemmas which don't make any assumptions of convexity (Subsection D.1.1), followed by the subset that assume convexity (Subsection D.1.2) D.1.1 GENERAL DEFINITIONS AND LEMMAS Definition D.1 (β-Smoothness). A function h is β-smooth if it satisfies: ∇h(x) -∇h(y) ≤ β x -y , for any x, y This implies the following quadratic upper bound on h: ∇h(x), y -x ≥ -h(x) -h(y) + β 2 xy 2 , for any x, y Lemma D.2 (Relaxed triangle inequality). Let {v 1 , . . . , v τ } be τ vectors in R d . Then for any a > 0: v i + v j 2 ≤ (1 + a) v i 2 + 1 + 1 a v j 2 Also: τ i=1 v i 2 ≤ τ τ i=1 v i 2 Proof. The first statement for any a > 0 follows from the identity: v i + v j 2 = (1 + a) v i 2 + 1 + 1 a v j 2 - √ av i + 1 √ a v j 2 The second statement follows from the convexity of v → v 2 and Jensen's inequality: 1 τ τ i=1 v i 2 ≤ 1 τ τ i=1 v i 2 Lemma D.3 (Separating mean and variance). Let {Ξ 1 , . . . , Ξ τ } be τ random variables in R d which are not necessarily independent. First suppose that their mean is E[Ξ i ] = ξ i and variance is bounded as E[ Ξ i -ξ i 2 ] ≤ σ 2 . Then: E   τ i=1 Ξ i 2   ≤ τ i=1 ξ i 2 + τ 2 σ 2 Now instead suppose that their conditional mean is E[Ξ i |Ξ i-1 , . . . , Ξ 1 ] = ξ i , i.e. the variables {Ξ i -ξ i } form a martingale difference sequence, and the variance is bounded same as above. Then we can show the tighter bound: E   τ i=1 Ξ i 2   ≤ 2E   τ i=1 ξ i 2   + 2τ σ 2 Proof. For any random variable X, E[X 2 ] = (E[X -E[X]]) 2 + (E[X]) 2 implying: E   τ i=1 Ξ i 2   = τ i=1 ξ i 2 + E   τ i=1 (Ξ i -ξ i ) 2   Expanding the last term of the above expression using relaxed triangle inequality (Lemma D.2) proves the first claim: E   τ i=1 (Ξ i -ξ i ) 2   ≤ τ τ i=1 E Ξ i -ξ i 2 ≤ τ 2 σ 2 For the second statement, ξ i is not deterministic and depends on Ξ i-1 , . . . , Ξ 1 . Hence we have to resort to the cruder relaxed triangle inequality to claim: E   τ i=1 Ξ i 2   ≤ 2E   τ i=1 ξ i 2   + 2E   τ i=1 (Ξ i -ξ i ) 2   Then we use the tighter expansion of the second term: E   τ i=1 (Ξ i -ξ i ) 2   = i,j E [ Ξ i -ξ i , Ξ j -ξ j ] = i E Ξ i -ξ i 2 ≤ τ σ 2 The cross terms in the above expression have zero mean since {Ξ i -ξ i } form a martingale difference sequence.

D.1.2 DEFINITIONS AND LEMMAS ASSUMING CONVEXITY

Definition D.4 (µ-Convexity). A function h is µ-convex for µ ≥ 0 if it satisfies: ∇h(x), y -x ≤ -h(x) -h(y) + µ 2 xy 2 , for any x, y When µ > 0, we have strong convexity, a quadratic lower bound on h. Proposition D.5 (Convexity and smoothness). If client losses f i and centralized loss f c are each β-smooth (Definition D.1), and x * is an optimum of the overall loss f (as defined in Equation 1), then the following holds true: 1 2β 1 N N i=1 ∇f i (x) -∇f i (x * ) 2 + ∇f c (x) -∇f c (x * ) 2 ≤ f (x) -f (x * ) Proof. Define the functions fi (x) := f i (x) -∇f i (x * ), x , for all clients i, and the function fc (x) := f c (x) -∇f c (x * ), x . Since f i and f c are convex and β-smooth, so are fi and fc , and furthermore their gradients vanish at x * ; hence, x * is a common minimizer for fi , fc and f . Using the β-smoothness of fi and f c , we have 1 2β ∇ fi (x) 2 ≤ fi (x) -fi (x) and 1 2β ∇ fc (x) 2 ≤ fc (x) -fc (x). Note that 1 N N i=1 fi + fc = f since 1 N N i=1 ∇f i (x * ) + ∇f c (x * ) = ∇f (x * ) = 0. The claimed bound then follows from the above two facts. Proof. Expanding terms, and applying smoothness (Definition D.1): x -η∇h(x) -y + η∇h(y) 2 = xy 2 + η 2 ∇h(x) -∇h(y) We will now prove the rates of convergence stated in Theorem C.2 for 1-WAY GRADIENT TRANSFER. Subsection D.2.1 proves the convergence rates for strongly convex and general convex cases, and Subsection D.2.2 proves the convergence rates for the non-convex case. Let S be the cardinality of the cohort of clients S participating in a round of training. Let the server and client optimizers be SGD. Let the clients all take an equal number of steps K, and let η be the 'effective step-size', equal to Kη s η. With 1-WAY GRADIENT TRANSFER, the server update of the global model at round t can be written as: x (t+1) -x (t) = - η KS i∈S K k=1 g i (x (t,k) i ) + g c (x (t) ) x (t+1) -x (t) = -ηg c (x (t) ) - η KS i∈S K k=1 g i (x (t,k) i ) Henceforth, let E |t [•] denote expectation conditioned on x (t) . As in Karimireddy et al. (2020b) , we'll define a client local 'drift' term in round t as: E (t) = 1 KN N i=1 K k=1 E |t x (t,k) i -x (t) 2 Lemma D.9 (Bound on variance of server update). The variance of the server update is bounded as: E |t x (t+1) -x (t) 2 ≤ 4η 2 β 2 E (t) + 2η 2 2 KS σ 2 + σ 2 c + 2η 2 E |t ∇f (x (t) ) Proof. Let S denote the set of clients sampled in round t. For brevity, we will use ∆x to refer to x (t+1) -x (t) . E |t ∆x 2 = E |t x (t+1) -x (t) 2 = E |t   η KS i∈S K k=1 g i (x (t,k) i ) + g c (x (t) ) 2   ≤ E |t   η KS i∈S K k=1 g i (x (t,k) i ) -∇f f (x (t) ) + ∇f f (x (t) ) + g c (x (t) ) 2   We separate terms by applying the relaxed triangle inequality (Lemma D.2): E |t ∆x 2 ≤ 2η 2 E |t   1 KS i∈S K k=1 g i (x (t,k) i ) -∇f f (x (t) ) 2   A + 2η 2 E |t ∇f f (x (t) ) + g c (x (t) ) 2 B In term A, we separate mean and variance for the client stochastic gradients g i , using Lemma D.3 and Equation 4: A ≤ 4η 2 E |t   1 KS i∈S K k=1 ∇f f (x (t,k) i ) -∇f f (x (t) ) 2   + 4η 2 σ 2 KS We apply the relaxed triangle inequality (Lemma D.2) followed by smoothness (Definition D.1), to convert it to an expression in terms of drift E (t) : A ≤ 4η 2 KN N i=1 K k=1 E |t ∇f f (x (t,k) i ) -∇f f (x (t) ) 2 + 4η 2 σ 2 KS ≤ 4η 2 β 2 KN N i=1 K k=1 E |t x (t,k) i -x (t) 2 + 4η 2 σ 2 KS ≤ 4η 2 β 2 E (t) + 4η 2 σ 2 KS In term B we have a full gradient of the federated loss ∇f f and a stochastic gradient of the centralized loss g c . We use Lemma D.3 to separate the stochastic gradient into a full gradient of the centralized loss ∇f c and a variance term, allowing us to express in terms of full gradient of the overall loss ∇f . B = 2η 2 E |t ∇f f (x (t) ) + g c (x (t) ) 2 ≤ 2η 2 E |t ∇f f (x (t) ) + ∇f c (x (t) ) 2 + 2η 2 σ 2 c ≤ 2η 2 E |t ∇f (x (t) ) 2 + 2η 2 σ 2 c Combining A and B back together: E |t ∆x 2 ≤ 4η 2 β 2 E (t) + 2η 2 2 KS σ 2 + σ 2 c + 2η 2 E |t ∇f (x (t) ) 2 D.2.1 CONVEX CASES We will state two lemmas, one (Lemma D.10) related to the progress in round t towards reaching x * , and the other (Lemma D.11) bounding the federated clients 'drift' in round t, E (t) . We then combine the two lemmas together to give the proofs of convergence rate for the strongly convex (µ > 0) and general convex (µ = 0) cases. Lemma D.10 (One round progress). Suppose our functions satisfy bounded variance σ 2 , µ-convexity (Definition D.4), and β-smoothness (Definition D.1). If η < 1 8β , the updates of 1-WAY GRADIENT TRANSFER satisfy: E |t x (t+1) -x * 2 ≤ 1 - 3µη 2 x (t) -x * 2 -η f (x (t) ) -f (x * ) + 5 16 E (t) + 2η 2 2 KS σ 2 + σ 2 c Proof. The expected server update, with N total clients in the federated population, is: E x (t+1) -x (t) = -ηE g c (x (t) ) - η KN N i=1 K k=1 E g i (x (t,k) i ) The distance from optimal x * in parameter space at round t is x (t) -x * 2 . The expected distance from optimal at round t + 1, conditioned on x (t) and earlier rounds, is: E |t x (t+1) -x * 2 = E |t x (t+1) -x (t) + x (t) -x * 2 = x (t) -x * 2 + 2 E |t x (t+1) -x (t) , x (t) -x * C + E |t x (t+1) -x (t) 2 D For clarity, we now focus on individual terms, beginning with C: C = 2 E |t x (t+1) -x (t) , x (t) -x * = 2 -ηE g c (x (t) ) - η KN N i=1 K k=1 E g i (x (t,k) i ) , x (t) -x * = 2η ∇f c (x (t) ), x * -x (t) C1 + 2η KN N i=1 K k=1 ∇f f (x (t,k) i ), x * -x (t)

C2

We can use convexity (Definition D.4) to bound C1, with x = x (t) , and y = x * : C1 ≤ -2η f c (x (t) ) -f c (x * ) + µ 2 x (t) -x * 2 We apply perturbed convexity (Lemma D.7) to bound C2, with x = x (t,k) i , y = x * , and z = x (t) : C2 ≤ 2η KN N i=1 K k=1 f f (x * ) -f f (x (t) ) + β x (t,k) i -x (t) 2 - µ 4 x (t) -x * 2 ≤ -2η f f (x (t) ) -f f (x * ) + µ 4 x (t) -x * 2 + 2β η KN N i=1 K k=1 x (t,k) i -x (t) 2 ≤ -2η f f (x (t) ) -f f (x * ) + µ 4 x (t) -x * 2 + 2β ηE (t) Combining C1 and C2 back together: C ≤ -2η f (x (t) ) -f (x * ) + 3µ 4 x (t) -x * 2 + 2β ηE (t) Now we turn to term D, which is the variance of the server update (from Lemma D.9): D = E |t x (t+1) -x (t) 2 ≤ 4η 2 β 2 E (t) + 2η 2 2 KS σ 2 + σ 2 c + 2η 2 E |t ∇f (x (t) ) 2 We can leverage Proposition D.6 to replace the norm squared of the gradient of the overall loss: D = E |t x (t+1) -x (t) 2 ≤ 4η 2 β 2 E (t) + 2η 2 2 KS σ 2 + σ 2 c + 8η 2 βE |t f (x (t) ) -f (x * ) Returning to our equation for the expected distance from optimal x * in parameter space, and making use of the bounds we established for C and D: E |t x (t+1) -x * 2 = x (t) -x * 2 + 2 E |t x (t+1) -x (t) , x (t) -x * C + E |t x (t+1) -x (t) 2 D ≤ x (t) -x * 2 -2η f (x (t) ) -f (x * ) + 3µ 4 x (t) -x * 2 + 2β ηE (t) + 4η 2 β 2 E (t) + 2η 2 2 KS σ 2 + σ 2 c + 8η 2 βE |t f (x (t) ) -f (x * ) ≤ 1 - 3µη 2 x (t) -x * 2 + 8η 2 β -2η f (x (t) ) -f (x * ) + 2ηβ (1 + 2ηβ) E (t) + 2η 2 2 KS σ 2 + σ 2 c Assuming that η ≤ 1 8β : E |t x (t+1) -x * 2 ≤ 1 - 3µη 2 x (t) -x * 2 -η f (x (t) ) -f (x * ) + 5 16 E (t) + 2η 2 2 KS σ 2 + σ 2 c Lemma D.11 (Bounded drift). Suppose our functions satisfy bounded variance, µ-convexity (Definition D.4), and β-smoothness (Definition D.1). Then the drift is bounded as: E (t) ≤ 12K 2 η 2 βE f (x (t) ) -f (x * ) + 3K 2 η 2 1 K σ 2 + σ 2 c Proof. We begin with the summand of the drift term, looking at the drift of a particular client i at local step k. Expanding this summand out: E |t x (t,k) i -x (t) 2 = E |t x (t,k-1) i -η g i (x (t,k-1) i ) + g c (x (t) ) -x (t) 2 = E |t x (t,k-1) i -x (t) -ηg i (x (t,k-1) i ) -ηg c (x (t) ) 2 . Separating mean and variance of the client gradient, then using the relaxed triangle inequality (Lemma D.2) to further separate out terms: E |t x (t,k) i -x (t) 2 ≤ E |t x (t,k-1) i -x (t) -η∇f f (x (t,k-1) i ) -ηg c (x (t) ) 2 + η 2 σ 2 ≤ 1 + 1 a E |t x (t,k-1) i -x (t) -η ∇f f (x (t,k-1) i ) -∇f f (x (t) ) 2 F + (1 + a) η 2 ∇f f (x (t) ) + g (x (t) ) 2 + η 2 σ 2 . Term F is bounded via the contractive mapping lemma (Lemma D.8), provided that η ≤ 1 β : F ≤ (1 -µη)E |t x (t,k-1) i -x (t) 2 ≤ E |t x (t,k-1) i -x (t) 2 . Putting back into the bound on drift on client i at local step k, and letting a = K: E |t x (t,k) i -x (t) 2 ≤ K+1 K E |t x (t,k-1) i -x (t) 2 + 2Kη 2 ∇f f (x (t) ) + g c (x (t) ) 2 + η 2 σ 2 . Unrolling the recursion: E |t x (t,k) i -x (t) 2 ≤ 2Kη 2 ∇f f (x (t) ) + g c (x (t) ) 2 + η 2 σ 2 k-1 j=0 K+1 K j ≤ 2Kη 2 ∇f f (x (t) ) + g c (x (t) ) 2 + η 2 σ 2 (2K) ≤ 4K 2 η 2 ∇f f (x (t) ) + g c (x (t) ) 2 + 2Kη 2 σ 2 . The second inequality above uses the following bound: k-1 j=0 K+1 K j ((1 + 1 K ) k -1) = K ≤ (e -1)K ≤ 2K. Now separating mean and variance of the central gradient: E |t x (t,k) i -x (t) 2 ≤ 4K 2 η 2 ∇f f (x (t) ) + ∇f c (x (t) ) 2 + 2K 2 η 2 σ 2 c + 2Kη 2 σ 2 ≤ 4K 2 η 2 ∇f (x (t) ) 2 + 2K 2 η 2 1 K σ 2 + σ 2 c . Finally, we apply Proposition D.6: E (t) ≤ 4K 2 η 2 ∇f (x (t) ) 2 + 2K 2 η 2 1 K σ 2 + σ 2 c ≤ 16K 2 η 2 βE |t f (x (t) ) -f (x * ) + 2K 2 η 2 1 K σ 2 + σ 2 c . Assuming that η ≤ 1 8β : E (t) ≤ 2 η η 2 s E |t f (x (t) ) -f (x * ) + 2 η2 η 2 s 1 K σ 2 + σ 2 c Proofs of Theorem C.2 for Convex Cases Adding the statements of Lemmas D.10 and D.11, and assuming that η s > 5 8 S, η = 1 8βKηs so that η = 1 8β , we get: E |t x (t+1) -x * 2 ≤ 1 - 3µη 2 x (t) -x * 2 -η f (x (t) ) -f (x * ) + 5 16 2 η η 2 s E |t f (x (t) ) -f (x * ) + 2 η2 η 2 s 1 K σ 2 + σ 2 c + 2η 2 2 KS σ 2 + σ 2 c = 1 - 3µη 2 x (t) -x * 2 -η f (x (t) ) -f (x * ) + 5 8 η η 2 s E |t f (x (t) ) -f (x * ) + 5 8 η2 Kη 2 s + 4 η2 KS σ 2 + 5 8 η2 η 2 s + 2η 2 σ 2 c ≤ 1 - 3µη 2 x (t) -x * 2 - S -1 S ηE |t f (x (t) ) -f (x * ) + 5σ 2 KS + 3σ 2 c η2 . We can now remove the conditioning over x (t) by taking an expectation on both sides over x (t) , to get a recurrence relation of the same form. For the case of strong convexity (µ > 0), we can use lemmas (e.g., Lemma 1 in Karimireddy et al. (2020b), Lemma 2 in Stich ( 2019)) which establish a linear convergence rate for such recursions. This results in the following bound 16 for T ≥ 8β 3µ : E f ( x(T ) ) -f (x * ) = Õ σ 2 + KSσ 2 c µKST + µ x (0) -x * 2 exp -3µT 16β , where x(T ) is a weighted average of x (1) , x (2) , . . . , x (T +1) with geometrically decreasing weights (1 -3µη 2 ) 1-r for x (r) , r = 1, 2, . . . , T + 1. This yields an expression for the number of rounds T to reach an error : T = Õ σ 2 + KSσ 2 c KSµ + β µ log 1 For the case of general convexity (µ = 0), we can use lemmas (e.g., Lemma 2 in Karimireddy et al. (2020b), Lemma 4 in Stich ( 2019)) which establish a sublinear convergence rate for such recursions. In this case we get the following bound: E f ( x(T ) ) -f (x * ) ≤ S S -1 8β x (0) -x * 2 T + 1 + 20σ 2 + 12KSσ 2 c x (0) -x * KS (T + 1) , where x(T ) = 1 T +1 T +1 t=1 x (t) . This yields an expression for the number of rounds T to reach an error : T = O σ 2 + KSσ 2 c D 2 KS 2 + βD 2 . In the above expression, D 2 is a distance in parameter space at initialization, x (0) -x * 2 .

D.2.2 NON-CONVEX CASE

We will now prove the rate of convergence stated in Theorem C.2 for the non-convex case for 1-WAY GRADIENT TRANSFER. We will state two lemmas, one (Lemma D.12) establishing the progress made in each round, and one (Lemma D.13) bounding how much the federated clients 'drift' in a round during the course of local training. We then combine the two lemmas together give the proof of convergence rate for the non-convex case. Lemma D.12 (Non-convex one round progress). The progress made in a round can be bounded as: E |t f (x (t+1) ) ≤ f (x (t) ) - 4η 9 ∇f (x (t) ) 2 + β 27 E (t) + 2 KS σ 2 + σ 2 c β Proof. We begin by using the smoothness of f to get the following bound on the expectation of f (x (t+1) ) conditioned on x (t) : t+1) ) ≤ E |t f (x (t) ) + ∇f (x (t) ), x (t+1) -x (t) + β 2 x (t+1) -x (t) 2 ≤ f (x (t) ) + E |t ∇f (x (t) ), x (t+1) -x (t) + β 2 E |t x (t+1) -x (t) 2 . E |t f (x Substituting in the definition of the 1-WAY GRADIENT TRANSFER server update (Equation 11), and using Assumption 5.1 for the expectation of the client stochastic gradient: E |t f (x (t+1) ) ≤ f (x (t) ) + E |t ∇f (x (t) ), - η KS i∈S K k=1 g i (x (t,k) i ) + g c (x (t) ) + β 2 E |t x (t+1) -x (t) 2 ≤ f (x (t) ) -η ∇f (x (t) ), 1 KN N i=1 K k=1 ∇f f (x (t,k) i ) + ∇f c (x (t) ) + β 2 E |t x (t+1) -x (t) 2 . Next, we make use of the fact that -ab = 1 2 ((b -a) 2 -a 2 -b 2 ) ≤ -1 2 a 2 + 1 2 (b -a) 2 : E |t f (x (t+1) ) ≤ f (x (t) ) - η 2 ∇f (x (t) ) 2 + η 2 1 KN N i=1 K k=1 ∇f f (x (t,k) i ) + ∇f c (x (t) ) -∇f (x (t) ) 2 + β 2 E |t x (t+1) -x (t) 2 ≤ f (x (t) ) - η 2 ∇f (x (t) ) 2 + η 2 1 KN N i=1 K k=1 ∇f f (x (t,k) i ) -∇f f (x (t) ) 2 + β 2 E |t x (t+1) -x (t) 2 ≤ f (x (t) ) - η 2 ∇f (x (t) ) 2 + η 2 1 KN N i=1 K k=1 E |t ∇f f (x (t,k) i ) -∇f f (x (t) ) 2 + β 2 E |t x (t+1) -x (t) 2 . Next, we use smoothness (Definition D.1), and the definition of client drift (Equation 12): E |t f (x (t+1) ) ≤ f (x (t) ) - η 2 ∇f (x (t) ) 2 + ηβ 2 2 1 KN N i=1 K k=1 E |t x (t,k) i -x (t) 2 + β 2 E |t x (t+1) -x (t) 2 ≤ f (x (t) ) - η 2 ∇f (x (t) ) 2 + ηβ 2 2 E (t) + β 2 E |t x (t+1) -x (t) 2 . The last term is the variance of the server update, for which we can substitute the bound from Lemma D.9: E |t f (x (t+1) ) ≤ f (x (t) ) - η 2 ∇f (x (t) ) 2 + ηβ 2 2 E (t) + β 2 4η 2 β 2 E (t) + 2η 2 2 KS σ 2 + σ 2 c + 2η 2 E |t ∇f (x (t) ) 2 ≤ f (x (t) ) - η 2 -β η2 ∇f (x (t) ) 2 + ηβ 2 2 + 2η 2 β 3 E (t) + η2 β 2 KS σ 2 + σ 2 c . Assuming a bound on effective step-size η ≤ 1 18β : E |t f (x (t+1) ) ≤ f (x (t) ) - 4η 9 ∇f (x (t) ) 2 + β 27 E (t) + 2 KS σ 2 + σ 2 c β η2 . Lemma D.13 (Non-convex bounded drift). Suppose our functions satisfy bounded variance and β-smoothness (Definition D.1). Then the drift is bounded as: E (t) ≤ 4η 9βη 2 s E ∇f (x (t) ) 2 + 2η 2 η 2 s 1 K σ 2 + 4σ 2 c . Proof. We begin with the summand of the drift term, looking at the drift of a particular client i at local step k. Expanding this summand out: E x (t,k) i -x (t) 2 = E x (t,k-1) i -η g i (x (t,k-1) i ) + g c (x (t) ) -x (t) 2 = E x (t,k-1) i -x (t) -ηg i (x (t,k-1) i ) -ηg c (x (t) ) 2 . Separating mean and variance of the client gradient: E x (t,k) i -x (t) 2 ≤ E x (t,k-1) i -x (t) -η∇f f (x (t,k-1) i ) -ηg c (x (t) ) 2 + η 2 σ 2 . Next we use relaxed triangle inequality (Lemma D.2) to further separate terms: E x (t,k) i -x (t) 2 ≤ 1 + 1 a E x (t,k-1) i -x (t) 2 + (1 + a) η 2 E ∇f f (x (t,k-1) i ) + g c (x (t) ) 2 + η 2 σ 2 ≤ 1 + 1 a E x (t,k-1) i -x (t) 2 + (1 + a) η 2 E ∇f f (x (t,k-1) i ) -∇f f (x (t) ) + g c (x (t) ) -∇f c (x (t) ) + ∇f (x (t) ) 2 + η 2 σ 2 ≤ 1 + 1 a E x (t,k-1) i -x (t) 2 + (1 + a) 2η 2 E ∇f f (x (t,k-1) i ) -∇f f (x (t) ) 2 H + (1 + a) 4η 2 E ∇f (x (t) ) 2 + (1 + a) 4η 2 E g c (x (t) ) -∇f c (x (t) ) 2 J +η 2 σ 2 . In the above inequality, term H can be converted via smoothness (Definition D.1), and term J is the variance of the centralized stochastic gradient (Equation 6). Letting a = K, we have: E x (t,k) i -x (t) 2 ≤ K + 1 K + 2Kη 2 β 2 E x (t,k-1) i -x (t) 2 + 4Kη 2 E ∇f (x (t) ) 2 + 4Kη 2 σ 2 c + η 2 σ 2 . Unrolling the above recurrence, we get:  E x (t,k) i -x (t) 2 ≤ 4Kη 2 E ∇f (x (t) ) 2 + 4Kη 2 σ 2 c + η 2 σ 2 k-1 j=0 K + 1 K + 2Kη 2 β 2 j ≤ 4η 2 Kη 2 s E ∇f (x (t) ) 2 + η2 Kη 2 s 1 K σ 2 + 4σ 2 c k-1 j=0 K + 1 K + 2η

2K

Adding back the summation terms over i and k, the bound on client drift is: E (t) ≤ 4η 9βη 2 s E ∇f (x (t) ) 2 + 2η 2 η 2 s 1 K σ 2 + 4σ 2 c . Proofs of Theorem C.2 for Non-Convex Case Adding the statements of Lemmas D.12 and D.13, and assuming η s ≥ √ S, we get: E |t f (x (t+1) ) ≤ f (x (t) ) - 4η 9 ∇f (x (t) ) 2 + 2 KS σ 2 + σ 2 c β η2 + β 27 4η 9βη 2 s E ∇f (x (t) ) 2 + 2η 2 η 2 s 1 K σ 2 + 4σ 2 c ≤ f (x (t) ) - 1 3 η ∇f (x (t) ) 2 + 3 KS σ 2 + 2σ 2 c β η2 With the above, we have a recursive bound on the loss after round t + 1. We can use lemmas (e. In the above expressions, F is the error at initialization, f (x (0) ) -f (x * ). This yields an expression for the number of rounds T to reach an error : T = O σ 2 + KSσ 2 c βF KS 2 + βF .



We use 'joint' to distinguish our work from sequential 'central-then-FL' use cases, e.g. transfer learning. To simplify we subsume any relative weights into loss terms, i.e. this can be f (x) = (wf ff(x))+(wc fc(x)). Were they not to differ, one could treat a centralized compute node as an additional client in standard FL, and simply make use of an established FL algorithm like FEDAVG for training x. Even stronger privacy properties are possible when FL is combined with technologies such as differential privacy (DP) and secure multiparty computation (SMPC)(Wang et al., 2021). An interesting similarity between PDA-DPMD(Amid et al., 2021) and our work: in PDA-DPMD for FL, a first order approximation of mirror descent is used, where the server model update is calculated as weighted sum of private (federated) and public loss terms, just as in PARALLEL TRAINING or 2-WAY GRADIENT TRANSFER. CelebA federated data available via open source FL software (TFF CelebA documentation, 2022). Stack Overflow federated data available via open source (TFF StackOverflow documentation, 2022). Wikipedia data available via open source (TFDS Wikipedia (20201201.en) documentation, 2022). E.g., for a next URL prediction task with millions of URLs the embedding table size can reach gigabytes. By its definition in Equation 2 combined with the triangle inequality. By its definition in Equation2, it is convex combination of fi. Note: ∇ff(x) is useful for theoretical convergence analysis, but cannot be practically computed in a real cross-device FL setting. In contrast, ∇fc(x) can be computed. An open-source mixed FL code repository will be shared with the public version of the paper. Adapted from an online tutorial involving CelebA binary attribute classification: "TensorFlow Constrained Optimization Example Using CelebA Dataset". Adapted from an online tutorial involving next character prediction: "Text generation with an RNN". The Õ notation hides dependence on logarithmic terms which can be removed by using varying step-sizes.



(a) Smile Classifier: Eval. AUC (ROC) vs. Round. (b) Language Model: Evaluation Accuracy vs. Round.

Figure 1: Mixed FL resolves distribution shift, enabling accuracy equal to if data were colocated and centrally trained ('oracle'). The smile classifier reaches an oracle's evaluation AUC of ROC of over 0.95. The language model reaches an oracle's evaluation accuracy of over 0.66. Evaluation is over all data (i.e., smiling and non-smiling faces; Stack Overflow and Wikipedia). Plots show 95% conf.

Figure 2: Comparative convergence differs; (a) shows PT converging worse than 1-W GT or 2-W GT, while (b) shows all algorithms converging the same. See Section 5.3 for theoretical explanation.

Figure 3: Movie prediction, Eval. Loss vs. Round.We see that mixed FL results in similar loss as the more expensive baseline scenario (see Table2).

For simplicity, this paper described mixed FL with FEDAVG used for federated training. However, one could useDP-FedAvg (McMahan et al. (2018)) or DP-FTRL(Kairouz et al. (2021b)) with mixed FL to ensure that DP protections are applied to the federated data. Similarly, an adaptive SERVEROPTIMIZER like FEDADAM(Reddi et al., 2020) could be used in the federated portion of training (preliminary mixed FL results with FEDADAM are given in Appendix B.4.4).

(a) G2 t vs. Round t. (b) B2 t vs. Round t.

Figure 4: Sampled approximations of mixed FL (G, B)-BGD, for the three experiments in Section 4.

(a) Evaluation Accuracy on Stack Overflow vs. Round. (b) Evaluation Accuracy on Wikipedia vs. Round.

Figure 5: Language model accuracy, evaluated on only decentralized (left) or centralized (right) data.

Figure 6: Next movie prediction performance (recall@10).

(a) Eval. Loss vs. Round. (b) Eval. AUC (ROC) vs. Round.

Figure 7: Smile classifier training, 1-W GT with various central batch sizes |B c |.

(a) Eval. Loss vs. Round. (b) Eval. AUC (ROC) vs. Round.

Figure 8: Smile classifier training, 2-W GT with various central batch sizes |B c |.

(a) Eval. Loss vs. Round (PT). (b) Eval. AUC (ROC) vs. Round (PT). (c) Eval. Loss vs. Round (1-W GT). (d) Eval. AUC (ROC) vs. Round (1-W GT). (e) Eval. Loss vs. Round (2-W GT). (f) Eval. AUC (ROC) vs. Round (2-W GT).

Figure 9: Smile classifier training with different η and K settings (for each mixed FL algorithm).

(a) Eval. Loss vs. Round (PT). (b) Eval. Accuracy vs. Round (PT). (c) Eval. Loss vs. Round (1-W GT). (d) Eval. Accuracy vs. Round (1-W GT). (e) Eval. Loss vs. Round (2-W GT). (f) Eval. Accuracy vs. Round (2-W GT).

Figure 10: Language model training with different η and K settings (for each mixed FL algorithm).

(a) Evaluation Loss vs. Round. (b) Evaluation Accuracy vs. Round.

Figure 11: Language model training, with higher η for PT and 1-W GT (K = 16 for all). The increased learning rate boosts progress on evaluation loss and accuracy early in optimization, but does not change the number of rounds ultimately required for convergence. All three algorithms have reached similar loss and accuracy by round 3000, and are still converging.

Figure 12: Next movie prediction performance when training with adaptive optimizer (client optimizer: SGD, server optimizer: ADAM).

(a) Eval. Accuracy on All Data vs. Round. (b) Eval. Accuracy on Stack Overflow vs. Round. (c) Eval. Accuracy on Wikipedia vs. Round.

Figure 13: Mixed FL endeavors to learn both data distributions, but transfer learning fails to retain knowledge of the pretraining task. When the pretraining-to-fine-tuning switch happens at round 250, the transfer learning cases fit to the new distribution and 'forget' the previous distribution they were pretrained on. PARALLEL TRAINING does not suffer from this problem as it sees data from both distributions throughout training.

2η ∇h(x) -∇h(y), xy≤ xy 2 + η 2 β -2η ∇h(x) -∇h(y), xyIf step-size is such that η ≤ 1 β , then:η 2 β -2η ∇h(x) -∇h(y), xy ≤ -η ∇h(x) -∇h(y), xyFinally, for µ-strong convexity (Definition D.4) of h we have:-η ∇h(x) -∇h(y), xy ≤ -µη xy 2 D.2 PROOFS OF THEOREM C.2

g., Lemma 2 in Karimireddy et al. (2020b), Lemma 4 in Stich (2019)) which establish a sub-linear convergence rate for such recursions. Assuming η ≤ 1 18β and η s

Experiments summary. For training hyperparameters and further details, see Appendix B.

Movie recommendation: computation (COMP.) and communication (COMM.) overhead per client. The baseline scenario computes everything on clients. See Appendix B.3 for analysis.

Karimireddy et al. (2020b)).Let g f denote an unbiased stochastic gradient of federated loss f f , formed by randomly sampling a cohort of S (out of N total) clients, randomly sampling a batch B i of client data examples, and averaging the respective client stochastic gradients. Given Assumption 5.1 we bound variance of g f :

Sampled approximations of (G,B)-BGD. (See Appendix B.1 for details.) SMILE CLASSIFICATION LANGUAGE MODELING MOVIE RECOMMEND.

Table 4 states maximums of G2 t and B2 t for the experiments of Section 4. This provides some theoretical explanation for empirical observations made during the experiments. • Smile Classification (4.1) As discussed, PARALLEL TRAINING is at a disadvantage when G 0 or B 1, and Table 4 (left column) shows that Gt is significantly large in smile classification. This aligns with our empirical observation in Figure 2a of PARALLEL TRAINING converging slower than either GRADIENT TRANSFER algorithm.



Smile classifier training, federated hyperparameters.

Movie recommender training, centralized and overall hyperparameters.

Datasets The Stack Overflow dataset is a large-scale federated dataset, consisting of 342,477 training clients and 204,088 evaluation clients (TFF StackOverflow documentation, 2022). The training clients have average cache size of ∼ 400 examples, and evaluation clients have average cache size of ∼ 80 examples. The Wikipedia dataset (wikipedia/20201201.en) consists of 6,210,110 examples (TFDS Wikipedia (20201201.en) documentation, 2022), which we partition by designating the first 5,000,000 examples for training and the remainder for evaluation. All raw text data is processed into sequences of 100 characters.Our evaluation data is a combined dataset consisting of randomly shuffled examples drawn from the Stack Overflow evaluation clients and the Wikipedia dataset. After accounting for raw examples discarded during processing, the evaluation dataset is comprised of approximately 75% Stack Overflow and 25% Wikipedia. Hence, we used this weighting proportion when performing mixed FL training in the language modeling problem (Table8).

Andrew et al. (2021), without noise addition) and a fixed clip value of 0.2 for the centralized gradients. SGD was used for all optimizers: CLIENTOPTIMIZER and SERVEROPTIMIZER, and (if PT or 2-W GT) CENTRALOPTIMIZER and MERGEOPTIMIZER.

Computation and communication overheads for Movie Recommendation task.

Assuming η s ≥ 1, and η ≤ 1 18β , we have K+1 K + 2η 2 β 2

C.2 1-WAY GRADIENT TRANSFER

We now provide convergence bounds for the 1-WAY GRADIENT TRANSFER scenario. Unlike PARALLEL TRAINING, which could be thought of as a 'meta' version of an existing FL algorithm (FEDAVG), 1-WAY GRADIENT TRANSFER is an entirely new FL algorithm. As such, we must formulate a novel proof (Appendix D) of its convergence bounds.Given Assumption 5.1, the following Theorem gives the number of rounds to reach a given expected error. Theorem C.2. For 1-WAY GRADIENT TRANSFER, where the federated data is IID (Assumption 5.1), for β-smooth functions f i and f c , the number of rounds T to reach an expected error smaller than is:whereProof. Detailed proof given in Appendix D.

C.3 2-WAY GRADIENT TRANSFER

Given Assumption 5.1, one can view 2-WAY GRADIENT TRANSFER as a 'meta-SCAFFOLD' involving two 'meta-clients' (analogous to the view of PARALLEL TRAINING as 'meta-FEDAVG' in Subsection C.1). As such, we can take the convergence theorem for SCAFFOLD derived in Karimireddy et al. (2020b) (Section 5, Theorem III) and observe that it applies to the number of rounds T to reach convergence in the 2-WAY GRADIENT TRANSFER scenario.Theorem C.3. For 2-WAY GRADIENT TRANSFER, where the federated data is IID (Assumption 5.1), for β-smooth functions f f and f c , the number of rounds T to reach an expected error smaller than is:Proof. The analysis is exactly along the lines of the analysis in Karimireddy et al. (2020b) , Appendix E, in the context of SCAFFOLD. Effectively, the analysis applies to the 'meta-SCAFFOLD' problem of 2-WAY GRADIENT TRANSFER, with two 'meta-clients', one being the central loss/data (with stochastic gradients with variance of σ 2 c ) and the other being the federated loss/data. The homogeneity of the clients and the averaging over the sampled clients effectively reduces the variance of the stochastic gradients to σ 2 /S. The analysis follows in a straightforward manner by accounting for the variance in appropriate places. We omit the details for brevity.Proposition D.6 (Convex bound on gradient of overall loss). If client losses f i and centralized loss f c are each µ-convex (Definition D.4) and β-smooth (Definition D.1), and x * is an optimum of the overall loss f (as defined in Equation 1), then the expected norm of the gradient of overall loss is bounded as:Proof.Applying the relaxed triangle inequality (Lemma D.2) twice:Applying Proposition D.5:Lemma D.7 (Perturbed strong convexity). The following holds for any β-smooth and µ-stronglyconvex function h, and any x, y, z in the domain of h:Proof. Given any x, y, and z, we get the following two inequalities using smoothness (Definition D.1) and strong convexity (Definition D.4) of h:Further, applying the relaxed triangle inequality (Lemma D.2) gives:Combining all the inequalities together we have:The lemma follows since β ≥ µ.Lemma D.8 (Contractive mapping). For any β-smooth and µ-strongly convex function h, values x and y in the domain of h, and step-size (learning rate) η ≤ 1 β , the following holds true:x -η∇h(x) -y + η∇h(y) 2 ≤ (1 -µη) xy

