EFFECTIVE DISTRIBUTED LEARNING WITH RANDOM FEATURES: IMPROVED BOUNDS AND ALGORITHMS

Abstract

In this paper, we study the statistical properties of distributed kernel ridge regression together with random features (DKRR-RF), and obtain optimal generalization bounds under the basic setting, which can substantially relax the restriction on the number of local machines in the existing state-of-art bounds. Specifically, we first show that the simple combination of divide-and-conquer technique and random features can achieve the same statistical accuracy as the exact KRR in expectation requiring only O(|D|) memory and O(|D| 1.5 ) time. Then, beyond the generalization bounds in expectation that demonstrate the average information for multiple trails, we derive generalization bounds in probability to capture the learning performance for a single trail. Finally, we propose an effective communication strategy to further improve the performance of DKRR-RF, and validate the theoretical bounds via numerical experiments.

1. INTRODUCTION

Kernel ridge regression (KRR) is one of the most popular nonparametric learning methods (Vapnik, 2000) . Despite the excellent theoretical guarantees, KRR does not scale well in large scale settings because of high time and memory complexities (Liu et al., 2013; 2014; 2017; 2018; 2020b; Liu & Liao, 2015; Li et al., 2018; 2019c) . Distributed learning (Zhang et al., 2013; Hsieh et al., 2014; Chang et al., 2017b; Li et al., 2019b; Lin et al., 2020) , random features (Rahimi & Recht, 2007; Sutherland & Schneider, 2015; Rudi & Rosasco, 2017; Rudi et al., 2018; Liu et al., 2020a; Avron et al., 2017a; Yu et al., 2016; Jacot et al., 2020) , and Nyström methods (Drineas & Mahoney, 2005; Ding & Liao, 2012; Yang et al., 2012; Camoriano et al., 2016; Si et al., 2016; Musco & Musco, 2017; Kriukova et al., 2017) are the most widely used large scale techniques to address the scalability issues. Recent statistical learning works on KRR together with large scale approaches demonstrate that these large scale approaches can not only obtain great computational gains but also can guarantee the optimal theoretical properties, such as KRR with divide-and-conquer (Zhang et al., 2013; 2015; Chang et al., 2017b; a; Guo et al., 2017; Lin et al., 2017; Li et al., 2019b; d; Lin et al., 2020) , with random features (Rudi & Rosasco, 2017; Li et al., 2019e; Carratino et al., 2018; Yang et al., 2012) , and with Nyström methods (Bach, 2013; Alaoui & Mahoney, 2015; Rudi et al., 2015; 2017; Ding et al., 2020) . The combinations of distributed learning and other large scale approaches are very intuitive but effective strategies to further improve the effectiveness, such as distributed learning with gradient descent algorithms (Lin & Zhou, 2018; Richards et al., 2020) , with multi-pass SGD (Lin & Rosasco, 2017; Lin & Cevher, 2018; 2020) , with random features (Li et al., 2019b) , and with Nyström methods (Yin et al., 2020) . The optimal generalization performance of these combining approaches has been studied, however, the main theoretical problem is that there is a strict restriction on the number of local machines. For sample, in (Lin & Zhou, 2018; Li et al., 2019b; Yin et al., 2020) , to guarantee the optimal generalization performance in the basic setting, the upper bounds of the local machines are restricted to be a constant, which is difficult to be satisfied in real applications. In this paper, we aim at enlarging the number of local machines by considering communications among different local machines. This paper makes the following three main contributions. Firstly, we improve the existing state-of-art results of the divide-and-conquer technique together with random features. We prove that the optimal generalization performance can be guaranteed even the partitions reach Ω( |D|), which are limited to a constant Ω(1) for the existing bounds in the basic setting, |D| is the size of the data sets. Secondly, to essentially reflect the generalization performance, beyond the minimax optimal rates in expectation, we derive optimal learning rates in probability, which can capture the learning performance for a single trail. Finally, we develop a communication strategy to further improve the performance of our proposed method, and validate the effectiveness of the proposed communications via both theoretical assessments and numerical experiments.

Related Work

The most related work includes the statistical analysis of distributed learning and random features. Distributed learning. Optimal learning rates for divide-and-conquer KRR in expectation were established in the seminal work (Zhang et al., 2013; 2015) . An improved bound was derived in (Lin et al., 2017) based on a novel tool of integral operator. Based on the proof techniques proposed in (Zhang et al., 2013; 2015; Lin et al., 2017) , optimal learning rates were established for distributed spectral algorithms (Guo et al., 2017; Blanchard & Mücke, 2018; Lin & Cevher, 2020) , distributed gradient descent algorithms (Lin & Zhou, 2018; Richards et al., 2020) , distributed semi-supervised learning (Chang et al., 2017b) , distributed local average regression (Chang et al., 2017a; Lin & Cevher, 2020) , localized SVM (Meister & Steinwart, 2016) , etc. Some other communication strategies for distributed learning have been provided, see e.g. (Fan et al., 2019; Li et al., 2019a; Lin & Cevher, 2020; Li et al., 2020) , and references therein. The theoretical analysis mentioned above shows that the divide-and-conquer learning can achieve the same statistical accuracy as the exact KRR, however, there is a strict restriction on the number of local machines. The optimal learning rates with a less strict condition on the number of local machines for distributed stochastic gradient methods and spectral algorithms were established in (Lin & Cevher, 2020) . In (Lin et al., 2020) , they considered the communications among different local machines to enlarge the number of local machines. However, the communication strategy proposed in (Lin et al., 2020) based on an operator representation, which requires communicating the input data between each local machine. Thus, it is difficult to protect the data privacy of each local machine. Furthermore, for each iteration, the communication complexity of each local machine is O(|D|d), where d denotes the dimension, which is infeasible in practice for large scale data sets. Random Features. The generalization bound of random features was first proposed in (Rahimi & Recht, 2008) , which shows that O(|D|) random features are needed for O(1/ |D|) learning rate. Some works further studied its theoretical performance (Cortes et al., 2010; Yang et al., 2012;  Kernel Ridge Regression (KRR) KRR is one of the most popular nonparametric learning methods (Shawe-Taylor & Cristianini, 2000; Vapnik, 2000) , which can be stated as f D,λ = arg min f ∈H K    1 |D| |D| i=1 (f (x i ) -y i ) 2 + λ f 2 K    , where λ > 0 is the regularization parameter, |D| is the size of D. Using the representation theorem (Shawe-Taylor & Cristianini, 2000; Vapnik, 2000) , f D,λ can be written as f D,λ (x) = |D| i=1 α i K(x i , x) with α = (K D + λI) -1 y D , where K D = 1 |D| [K D (x i , x j )] |D| i,j=1 is the |D| × |D| kernel matrix and y D = 1 |D| (y 1 , . . . , y |D| ) T . Despite the excellent theoretical guarantees (Blanchard & Krämer, 2010; Caponnetto & Vito, 2007) , KRR requires O(|D| 2 ) memory to store K D , and O(|D| 3 ) time to solve inverse of K D + λI, which is infeasible for large scale settings.

KRR with Random Features (KRR-RF)

Assuming the spectral measure has a density function (•), the corresponding shift-invariant kernel can be written as K(x, x ) = Ω ψ(x, ω)ψ(x , ω) (ω)dω, where ψ : X × Ω → R is a continuous and bounded function with respect to ω and x. The main idea behind random Fourier features is to approximate the kernel function K(x, x ) by its Monte-Carlo estimation (Rahimi & Recht, 2007)  : K M (x, x ) = 1 M M i=1 ψ(x, ω i )ψ(x , ω i ) = φ M (x), φ M (x ) , where φ M (x) = 1 √ M (ψ(x, ω 1 ), . . . , ψ(x, ω M )) T . The solution of KRR with random features can be written as f M,D,λ (x) = w T M,D,λ φ M (x) with w M,D,λ = (Φ M,D Φ T M,D + λI) -1 Φ M,D ȳD , where Distributed KRR with Random Features (DKRR-RF) Let {D j } m j=1 be m disjoint subsets with D = ∪ m j=1 D j . The distributed KRR with random features (DKRR-RF) is defined as Φ M,D = 1 √ |D| (φ M (x 1 ), . . . , φ M (x |D| )) f 0 M,D,λ = m j=1 |D j | |D| f M,Dj ,λ , where  f M,Dj ,λ (x) = w T M,Dj ,λ φ M (x) with w M,Dj ,λ = (Φ M,Dj Φ T M,Dj + λI) -1 Φ M,

3. DKRR-RF WITH COMMUNICATIONS (DRKK-RF-CM)

In this section, we will present an effective communication strategy to enlarge the number of local machines. We first give the motivation of our communication strategy, and then propose a communication-based method, called DRKK-RF-CM. The proposed communication strategy are adaptations from (Lin et al., 2020) to avoid communicating local data among partition nodes. Motivation. Let g M,D,λ : R M → R M be g M,D,λ (w) := Φ M,D Φ T M,D + λI w -Φ M,D ȳD . One can see that 2g M,D,λ (w) is the gradient of the empirical risk of 1 |D| (xi,yi)∈D w T φ M (x i ) -y i 2 + λ w 2 on w. From Eq.2, we know that for any w, the following equation holds:  w M,D,λ = w -Φ M,D Φ T M,D + λI -1 Φ M,D Φ T M,D + λI w -Φ M,D ȳD = w -Φ M,D Φ T M,D + λI -1 g M,D,λ (w). wt M,D,λ = wt-1 M,D,λ - m j=1 |D j | |D| Φ M,Dj Φ T M,Dj + λI -1 g M,D,λ ( wt-1 M,D,λ ). We propose an iterative procedure to implement the communication strategy Eq.6, which can be broken down into 4 steps. At first, each local machine computes the local gradient and communicates back to the global machine. Then the global machine computes the global gradient based on the local gradient, and communicates it to each local machine. In the third step, each local machine computes β t-1 j , and communicates back to the global machine. Finally, the global machine obtains the solution wt M,D,λ . More details can be seen in Algorithm 1. Complexity analysis. Space complexity: each local machine only needs to store Φ M,Dj and the local gradient g M,Dj ,λ , thus the space complexity of each local machine is O(M |D j | + M ) = O(M |D j |); Time complexity: for each local machine, we only need to compute the matrix multiplication Φ M,Dj Φ T M,Dj and the inverse of Φ M,Dj Φ T M,Dj + λI once. For each iteration, we need to compute the local gradient g M,Dj ,λ and β j for each local machine. Therefore, the total time complexity of each local machine is O(M 3 + M 2 |D j | + pM |D j |) , where p is the number of communication; Communication complexity: for each iteration, we only communicate the local gradient g M,Dj ,λ and β j to the global machine, and receive the gradient g M,D,λ and wt-1 M,D,λ from the global machine, so the total communication complexity is O(pM ). Remark 1. From the complexity analysis above, we can see that if the number of the communication p satisfying p ≤ M or p ≤ |D j |, then the time and space complexity of DKRR-RF-CM are the same as DKRR-RF. Only the communication complexity is slightly increased from O(M ) to O(pM ).

4. THEORETICAL ANALYSIS

In this section, we analyze the generalization performances of DKRR-RF and DKRR-RF-CM. The performance of the algorithm is usually measured by the expected risk E(f ) = X ×Y (f (x)y) 2 dρ(x, y). The optimal hypothesis f H K in H K is denoted by f H K = arg min f ∈H K E(f ), and we assume f H K exists in the paper. |D| and M |D|, then, for every δ ∈ (0, 1], with probability at least 1 -δ, we have E[E( f 0 M,D,λ )] -E(f H K ) = O |D| -1 2 log 2 (1/δ) . From Theorem 1, one can see that if m |D| and M |D|, the learning rate of the generalization bound can reach O(1/ |D|), which is optimal in a minmax sense (Rudi & Rosasco, 2017; Caponnetto & Vito, 2007) . It means that, in this basic setting, as long as the number of partitions and random features are in order Ω( |D|), the corresponding ridge regression estimator has optimal generalization properties. The assumption of |y| ≤ ζ can be related to the Bernstein condition or moment assumption (Blanchard & Krämer, 2016) , but for simplicity, in this paper, we only consider that Y is bounded. Optimal learning rates for divide-and-conquer KRR in expectation have been established in (Zhang et al., 2013; 2015; Lin et al., 2017) , etc. However, there is a strict restriction on the number of local machines m. Specifically, in (Lin et al., 2017) , to reach the optimal rate, m should to restrict to a constant m = Ω(1). In (Li et al., 2019b) , the authors have studied the generalization performance of the combination of divide-and-conquer technique and random features. Using the same setting as Theorem 1 (that is r = 1/2 and γ = 1 in Theorem 8 of Li et al. (2019b) ), they prove that, if M |D| and m Ω(1), then E[E( f 0 M,D,λ )] -E(f H K ) = O(|D| -1 2 log 2 (1/δ)). It means that to guarantee the optimal generalization properties, the number of partitions should be restricted to a constant, but for our result is Ω( |D|). In (Li et al., 2019b) , they also considered using the unlabeled data to enlarge the number of partitions. They have proved that (see Corollary 12 of Li et al. (2019b)  for detail), if M |D| and m |D * |/|D|, then E[E( f 0 M,D,λ )] -E(f H K ) = O(|D| -1 2 log 2 (1/δ)) , where D * is the dataset includes both labeled and unlabeled data. Thus, if we want the number of partitions m to reach Ω( |D|) as the same as our Theorem 1, the size of |D * | should be Ω( |D|) times of |D|. In this case, the data size of each local machine is |D| = |D| 3/2 / |D|, so the time and space complexity are the same as KRR with a single random feature technique. & Vito, 2007) , thus Theorem 1 proposes the optimal learning rate in expectation, which demonstrates the average information for multiple trails, but may fail to capture the learning performance for a single trail. To essentially reflect the generalization performance for a single trail, we derive the optimal learning rate in probability:

4.2. OPTIMAL LEARNING RATES

FOR DKRR IN PROBABILITY Note that E[E( f 0 M,D,λ )]-E(f H K ) = E[ f 0 M,D,λ -f H K 2 ρ ] (Caponnetto Theorem 2. Under the same assumptions as Theorem 1. If λ = Ω(|D| -1 2 ), |D 1 | = . . . |D m |, m |D| 1 4 and M |D| 1 2 , then, for every δ ∈ (0, 1], with probability at least 1 -δ, we have f 0 M,D,λ -f H K 2 ρ = O |D| -1 2 log 2 (1/δ) . To guarantee the optimal generalization properties in probability, the number of partitions should be restricted to Ω(|D| 1/4 ), which is stricter than Ω(|D| 1/2 ) in Theorem 1. This is because the generalization error in expectation can be decomposed into approximation error, sample error and distributed error (more details can be seen in Proposition 1 in Appendix), but the error decomposition in probability is not easy to separate a distributed error in probability to control the number of local machines. The derive the optimal learning rate, we provide a novel decomposition, please see details in Proposition 9 in Appendix. The following result demonstrates that the proposed communication strategy can enlarge the number of partitions in probability. Theorem 3. Under the same assumptions as Theorem 1, If λ = Ω(|D| -1 2 ), |D 1 | = . . . |D m |, m |D| p+1 2(p+2) and M |D| 1 2 , then, for every δ ∈ (0, 1], with probability at least 1 -δ, we have 

/

In probability |D| p+3 p+2 |D| 3(p+3) 2(p+2) pd|D| DKRR-RF (Li et al., 2019b) Ω(1) |D| 0.5 In expectation |D| |D| 2 |D| 0.5 DKRR-RF (Theorem 1) |D| 0.5 |D| 0.5 In expectation |D| |D| 1.5 |D| 0.5 DKRR-RF (Theorem 2) |D| 0.25 |D| 0.5 In probability |D| 1.25 |D| 1.75 |D| 0.5 DKRR-RF-CM (Theorem 3) |D| p+1 2(p+2) |D| 0.5 In probability |D| 2p+5 2p+4 |D| 3p+7 2p+4 p|D| 0.5 f p M,D,λ -f H K 2 ρ = O |D| -1 2 log p+2 (1/δ) , where f p M,D,λ is returned by Algorithm 1 under p-th iterations. Compared Theorem 3 with Theorem 2, it is clear that the proposed communication strategy can relax the restriction on m from Ω |D| 1/4 to Ω |D| (p+1)/(2(p+2)) . Note that m is monotonically increasing with the number of communications p, which can demonstrate the power of the proposed communications. When p → ∞, the partitions can reach Ω( |D|), which is the same as the generalization bound in expectation. Remark 2. In the main text of this paper, we only give the optimal rates of DKRR-RF in the basic setting. The fast learning rates can be achieved under favorable conditions, see in Appendix. Remark 3 (The Significance of Distributed Learning for RF). At first glance, it seems that the bottleneck in learning with random Fourier features is not the size of the dataset but the number of features. However, from (Rudi & Rosasco, 2017) , one can see that we only requiring O( |D|) random features to guarantee the optimal performance, so the total computational complexity is O(M 3 + M 2 |D|) = O(|D| 2 ). Thus, the computational bottleneck in learning is not only the size of random features, but also the size of dataset. If we don't consider reducing the size of D, the computational complexity is |D| 2 in the basic setting, which is not suitable for large scale problems. Distributed learning is one of the most popular methods to reduce the size of dataset. The distributed learning bring the distributed error, but can decrease the variance of the model Zhang et al. (2013; 2015) . Thus, how to choose an appropriate number of partitions to trade off the distributed error and the variance to guarantee the optimal performance is a very interesting and significant direction.

Comparisons of the Time and Space Complexities

Table 1 reports the statistical and computational properties of the related approaches and our theoretical findings under the basic setting. We see that our DKRR-RF can guarantee the optimal generalization performance in expectation only requiring |D| memory and |D| 1.5 time, which is more effective than other methods. For DKRR-RF-CM, we can also see that it can guarantee optimal generalization performance in probability requiring less complexity than the communication-based method of DKRR-CM (Lin et al., 2020) . Remark 4. In (Rudi et al., 2017) , the authors considered combining the Nyström method and preconditioned conjugate gradient (PCG) (Cutajar et al., 2016) to scale up KRR. As far as we know, it is the only existing work that can guarantee optimal statistical accuracy, only requiring |D| memory and |D| 1.5 time for KRR. In this paper, we consider combining distributed learning and random features, a completely different path from (Rudi et al., 2017) . Note that in our proposed method, we need to compute the inverse of Φ M,Dj Φ T M,Dj + λI, which requires |D| 1.5 time. Inspired by (Rudi et al., 2017) , we can also adopt PCG to avoid the inverse calculation, which can further speed up our proposed DKRR-RF-CM. The combination of DKRR-RF-CM and PCG may open a path to reach the linear time complexity for optimal learning rate.

Novelty and Proof Techniques

The most related works of our paper are (Li et al., 2019b) , (Lin et al., 2020) and (Rudi & Rosasco, 2017) . We discuss the novel techniques adopted to derive the improved results compared with them. Compared with (Li et al., 2019b) . (a) To derive the learning bounds, Q M,D := (C M + λI) -1/2 (C M -C M,D )(C M + λI) -1/2 is required to be estimated, where C M and C M,D are self-adjoint and positive operators defined in Definition 1 (see in Appendix). In (Li et al., 2019b) , they used a classical approach from (Chang et al., 2017b; Guo et al., 2017) to estimate Q M,D (see Lemmas 21 and 22 in (Li et al., 2019b) ), and obtain that Q M,D ≤ 1 √ λ (C M -C M,D )(C M + λI) -1/2 = O 1/λ|D| + N (λ)/λ|D| , where N (λ) is the effective dimension defined in Assumption 1 (see in Appendix). However, in our paper, we directly estimate Q M,D based on the concentration inequality for self-adjoint operators (Rudi & Rosasco, 2017; Lin & Cevher, 2020; 2018; Caponnetto & Yao, 2006) , and prove that Q M,D = O 1/(λ|D|) + 1/(λ|D|) (see in Proposition 6). Thus, our estimation of Q M,D is N (λ) tighter than that in (Li et al., 2019b) . This is the one of the key reasons why we can substantially relax the restriction on the number of local machines compared with (Li et al., 2019b) ; (b) We not only present the bounds in expectation, but also in probability. To derive the tight bound in probability, we provide new decompositions of f 0 M,D,λ -f M,D,λ and w0 M,D,λ -w M,D,λ , please see Proposition 9 in detail. As far as we know, these decompositions are novel; (c) We also consider the Newton Raphson iteration-based communication strategy. To derive the improved high-probability bounds with communication, we introduce a novel decomposition of f t M,D,λ -f M,D,λ ρ (see in Proposition 10). Compared with (Lin et al., 2020) . (a) In (Lin et al., 2020) , they also considered a communication strategy to enlarge the number of local machines. At first it seems that it only need to communicate the gradient information, but it should be noted that the gradient information is based on an operator representation (see Eq.( 7) in (Lin et al., 2020) ), which is usually infeasible in practice. The authors present a realization for the proposed strategy by communicating the data among each local machine, see in Appendix B (page 34 in (Lin et al., 2020) , step 1). Thus, the data privacy of each local machine is difficult to be protected. Furthermore, since it requires communicating data D j , j = 1, . . . , m, among each local machine, for each iteration, the communication complexity of each local machine is O(|D|d), which is too high for large scale data sets. However, the communication strategy proposed in this paper only requires communicating the gradient g M,Dj ,λ ( wt-1 M,D,λ ) and the model parameters β t-1 j , rather than the data, therefore our proposed strategy do better on privacy protection. Moreover, the communication complexity is only O(M ) for each local machine, M |D|, which is suitable for large scale data sets; (b) At first it seems that the proof techniques of (Lin et al., 2020) can be easily extended to our paper, but it is not true. If we use the same proof technical of (Lin et al., 2020) , we can only obtain that f M,D,λ -f M,λ ρ = (L M,D + λI) -1 (L M,D -L M )(f ρ -f M,λ ) = O((1/( √ λ|D|) + N (λ)/|D|) f ρ -f M,λ ρ / √ λ) , where f M,D,λ , f M,λ and f ρ are defined in Definition 12. Combing with Proposition 2, one can only obtain that f M,D,λ -f M,λ ρ = O 1/λ|D| + N (λ)/λ|D| f ρ -f M,λ ρ . However, in our paper, we introduce new decompositions of f M,D,λ -f M,λ ρ and f M,D,λ -f M,D,λ ρ (see Propositions 2 and 3 in detail), and further obtain that f M,D,λ -f M,λ ρ = O 1/( √ λ|D|) + N (λ)/|D| f ρ - f M,λ ρ (see in Proposition 4), which is Ω(1/ √ λ) tighter than the directly use of the techniques of (Lin et al., 2020) . The novel decompositions f M,D,λ -f M,λ ρ and f M,D,λ -f M,D,λ ρ are the key reasons why we can guarantee the optimal performance even under m = Ω( |D|). We only give an example here, but the novel decompositions have also been embedded in Propositions 5, 9, 10, etc. Compared with (Rudi & Rosasco, 2017) . (a) We study the statistical properties of the combination of distributed learning and random features, but in (Rudi & Rosasco, 2017) , they only consider random features. As mentioned above, to estimate the tight bound of the distributed error, we introduce a novel decomposition of f 0 M,D,λ -f M,D,λ to derive the tight bound in expectation, and novel decompositions of f 0 M,D,λ -f M,D,λ and f p M,D,λ -f M,D,λ to derive tight bounds in probability; (b) The combination not only brings distributed error, but also brings some other problems. If you compare the proofs of (Rudi & Rosasco, 2017 ) with these of our paper in detail, you can find that the decompositions of (Rudi & Rosasco, 2017) and ours are very different. Overall, we improve the existing state-of-art bounds in expectation, and provide novel communicationbased distributed bounds with RF in probability. Moreover, we introduce some novel techniques and decompositions to substantially relax the restriction on the number of local machines, which are non-trivial extensions of (Li et al., 2019b; Lin et al., 2020; Rudi & Rosasco, 2017) .

5. EXPERIMENTS

In this section, we validate our theoretical findings by performing experiments on both simulated and real datasets. Numerical Experiments. Inspired by numerical experiments in (Rudi & Rosasco, 2017; Li et al., 2019e) , we consider a spline kernel of order q: K 2q (x, x ) = 1 + ∞ k=1 cos(2πk(x -x ))/(k 2q ). If the marginal distribution of X is uniform on [0,1], then K 2q (x, x ) = 1 0 ψ(x, ω)ψ(x, ω) (ω)dω, where ψ(x, ω) = K q (x, ω) and (ω) is also uniform on [0,1]. The random features of the spline kernel are φ M (x) = (ψ(x, ω 1 ), . . . , ψ(x, ω M )) T / √ M . According to Theorem 1, 2 and 3, we set the size of the random features to be M = |D|, and fine tune λ around |D| -foot_1/foot_2 using 5-fold cross validation 1 , the tuned set is {2 -5 , 2 -3 , . . . 2 5 }|D| -1/2 . We let the target function f * be a Gaussian random variable with mean µ = K t (x, 0) and variance σ 2 = 0.01. We generate 10000 samples for training and 10000 samples for testing. We use the exact KRR as a baseline, which trains all samples in a batch. We compare our proposed DKRR-RF-CM (p = 2, 4, 8) with KRR and DKRR-RF. We repeat the training 5 times and estimate the averaged error on testing data. The mean square error on the test set with different partitions is given in Figure 1(a, b ), which can be summarized as follows: 1) When m is not too large, the distributed methods (DRKK-RF and DRKK-RF-CM) are always comparable to original KRR. There exists an upper bound of m, when larger than it, the error increases dramatically and is far from the original KRR. This verifies the theoretical statement in Theorem 1, 2 and 3; 2) The upper bound m of DKRR-RF-CM is much larger than DKRR-RF. This result is aligned with Theorem 3, which demonstrates that the proposed communication strategy can enlarge the upper bound m; 3) The upper bound m of DKRR-RF-CM monotonically increases with the number of communications, which verifies Theorem 3. Real Data. In this experiment, we consider the performance on real data. We use 6 publicly available datasets from LIBSVM Data 2 . The empirical evaluations with Gaussian kernel, exp(-xx 2 /σ), are given in Figure 2 , where the optimal σ and λ are selected by 5-fold cross-validation, σ ∈ {2 i , i = -10, -8, . . . , 10}, {2 -5 , 2 -3 , . . . 2 5 }|D| -1/2 , and the number of random features is 2 |D|. From Figure 2 , one can find that: 1) our DKRR-RF-CM are better than the original DKRR-RF on all data-sets; 2) the larger the iterations of the communication, the better the performance. The above results demonstrate that our communication-based DKRR-RF is effective.

6. CONCLUSION

In this paper, we study the generalization properties of the combination of distributed learning and random features for ridge regression. We first improve the existing results of divide-and-conquer KRR with random features in expectation. Then, beyond the expectation, we derive generalization error bound in probability. Finally, we propose a novel effective communication strategy to further improve the learning performance of the combination method, and demonstrate the power of communications via both theoretical assessments and numerical experiments. Our results may open several venues for both theoretical and empirical work: (a) combine the approach with gradient algorithms such as preconditioned conjugate gradient Avron et al. (2017a) and multi-pass SGD (Carratino et al., 2018; Lin & Cevher, 2018; 2020) ; (b) replace synchronous distributed methods with asynchronous ones (Suresh et al., 2017); (c) consider the loss functions other than quadratic loss (Li et al., 2019e) .

A APPENDIX: FAST LEARNING RATES

In the section, we will show that the fast learning rates can be achieved under favorable conditions. Let L 2 ρ X = f : X → R X f 2 (x)dρ X ≤ ∞ be the square integrable space, • 2 ρ be its norm. Denote the integral operator (Smale & Zhou, 2007 ) L K by L K f = X K(x, •)f (x)dρ X , ∀f ∈ L 2 ρ X . Assumption 1. For λ > 0, N (λ) is the effective dimension of the integral operator L K defined as N (λ) = Tr (L K + λI) -1 L K , where Tr is the trace. Assuming there exists a constant c ≥ 1, such that N (λ) ≤ cλ -γ , γ ∈ [0, 1]. The effective dimension is a common assumption within the framework of learning theory (Caponnetto & Vito, 2007; Smale & Zhou, 2005; Rudi & Rosasco, 2017) , which is used to measure the complexity of the hypothesis space. It is always satisfied for γ = 1 and c = κ. Equation 7 can control the variance of the estimator and is equivalent to the classic entropy and covering number conditions. In particular, it holds if the eigenvalues of integral operator L K decay as i -1/γ , which is satisfied by the popular Gaussian and polynomial kernel functions. More details can be seen in (Steinwart & Christmann, 2008; Caponnetto & Vito, 2007; Rudi & Rosasco, 2017) . Assumption 2. Let f ρ (x) = Y ydρ(y|x) be the regression function, ρ(y|x) be the conditional distribution at x induced by ρ. For 1 2 ≤ r ≤ 1, assume there exists a g ∈ L 2 ρ X such that f ρ (x) = L r K g(x), where L r K is the rth power of L K . Regression function f ρ is the best function in L 2 ρ X , which is the primary objective in regression problem. Assumption 2 is used to measure the complexity of the regression function f ρ , which is commonly used in approximation theory (Caponnetto & Vito, 2007) . Equation 8 can be used to control the bias of the estimator, it requires the expansion of the regression function f ρ having coefficients that decay faster than the eigenvalues of integral operator L K . The larger the value of r, the faster the coefficients decay. The case r = 1/2 means that f ρ ∈ H K . More detail can be seen in (Rudi & Rosasco, 2017; Smale & Zhou, 2007)  . Theorem 4. Suppose ψ is continuous, such that |ψ(x, ω)| ≤ τ almost surely, τ ∈ [1, ∞) and |y| ≤ ζ. Under Assumptions 1-2 with r ∈ [1/2, 1], γ ∈ [0, 1], if λ = Ω(|D| -1 2r+γ ), |D 1 | = . . . |D m |, m |D| 2r+γ-1 2r+γ and M |D| 1+γ(2r-1) 2r+γ , then, for every δ ∈ (0, 1], with probability at least 1 -δ, we have E f 0 M,D,λ -f ρ 2 ρ = O |D| -2r 2r+γ log 2 1 δ . The bound above is the same as the original KRR estimator and is optimal in a minimax sense (Caponnetto & Vito, 2007; Lin et al., 2017; Chang et al., 2017b) . In the best case, when r = 1 and γ = 0, the rate O(1/|D|) can be achieved by Ω( |D|) random features and Ω( |D|) partitions. In the worst case, that is r = 1/2 and γ = 1, which has been covered in Theorem 1. Theorem 5. Suppose ψ is continuous, such that |ψ(x, ω)| ≤ τ almost surely, τ ∈ [1, ∞) and |y| ≤ ζ. Under Assumptions 1-2 with r ∈ [1/2, 1], γ ∈ [0, 1], if λ = Ω(|D| -1 2r+γ ), |D 1 | = . . . |D m |, m |D| 2r+γ-1 4r+2γ and M |D| 1+γ(2r-1) 2r+γ , then, for every δ ∈ (0, 1], with probability at least 1 -δ, we have f 0 M,D,λ -f ρ 2 ρ = O |D| -2r 2r+γ log 2 1 δ . One can see that the upper bound of m is |D| 2r+γ-1 4r+2γ , which is stricter than that of Theorem 4. In the best case, when r = 1 and γ = 0, the rate O(1/|D|) can be achieved considering Ω( |D|) random features and Ω(|D| 1/4 ) partitions. For the basic setting, that is r = 1/2 and γ = 1, which has been given in Theorem 2. Theorem 6. Suppose ψ is continuous, such that |ψ( x, ω)| ≤ τ almost surely, τ ∈ [1, ∞) and |y| ≤ ζ. Under Assumptions 1-2 with r ∈ [1/2, 1], γ ∈ [0, 1], if λ = Ω(|D| -1 2r+γ ), |D 1 | = . . . |D m |, m |D| (2r+γ-1)(p+1) (2r+γ)(p+2) and M |D| 1+γ(2r-1) 2r+γ log 1 δ , then, for any δ ∈ (0, 1], with probability at least 1 -δ, f p M,D,λ -f ρ 2 ρ = O |D| -2r 2r+γ log p+2 1 δ . One can see that the communication can relax the restriction on the number of partitions. As p → ∞, the partitions can reach Ω(|D| 2r+γ-1 2r+γ ), which is the same as Theorem 4. Remark 5. In this paper, we focus on enlarging the number of local machines. In (Rudi & Rosasco, 2017; Rudi et al., 2018) , they have proved that when generating random features in a data-dependent manner, fewer random features are required to obtain optimal learning. Thus, if we adopt the data-dependent manner to generate random features, we can further improve the performance of DKRR-RF-CM with fewer random features.

B APPENDIX: NOTATION AND PRELIMINARY

In this paper we denote the operator norm by • , and the square integrable norm by • ρ . Definition 1. S M : R M → L 2 ρ X , (S M w)(x) = w, φ M (x) , S * M : L 2 ρ X → R M , S * M g = X φ M (x)g(x)dρ X (x), S * M,D : L 2 ρ X → R M , S * M,D g = 1 |D| xj ∈D X φ M (x j )g(x j ), C M : R M → R M , C M = X φ M (x)φ M (x) T dρ X (x), C M,D : R M → R M , C M,D = 1 |D| xj ∈D X φ M (x j )φ M (x j ) T . Lemma 1. C M and C M,D are self-adjoint and positive operators, with spectrum is [0, τ 2 ]. Moreover we have C M = S * M S M and C M,D = Φ M,D Φ T M,D = S * M,D S M . Proof. C M and C M,D are self-adjoint and positive operators, with spectrum is [0, τ 2 ], and C M = S * M S M , C M,D = Φ M,D Φ T M,D can be directly obtained from Caponnetto & Vito (2007) ; Smale & Zhou (2005; 2007) ; Rosasco et al. (2010) ; Rudi & Rosasco (2017) ; Lin et al. (2020) . In the following, we prove that C M,D = S * M,D S M . From the definitions of S M , S * M,D and C M,D in Definition 1, we have ∀β ∈ R M , (S M β)(•) = β, φ M (•) = φ M (•) T β, and thus we can obtain that,  S * M,D S M β = 1 |D| xj ∈D X φ M (x j )φ M (x j ) T β = C M,D β. So, the Equation C M,D = S * M,D S M holds. Definition 2. f M,D,λ = w T M,D,λ φ M (•), w M,D,λ = arg min w∈R M 1 |D| zi∈D w T φ M (x i ) -y i 2 + λ w 2 ; f M,D,λ = w T M,D,λ φ M (•), w M,D,λ = arg min w∈R M 1 |D| zi∈D w T φ M (x i ) -f ρ (x i ) 2 + λ w 2 ; f M,λ = w T M,λ φ M (•), w M,λ = arg min w∈R M X w T φ M (x) -f ρ (x) 2 dρ X (x) + λ w 2 ; f λ = arg min f ∈H K X (f (x) -f ρ (x)) 2 dρ X (x) + λ f 2 K . ( )). f M,D,λ = S M w M,D,λ , w M,D,λ = (C M,D + λI) -1 Φ M,D ȳD ; f M,D,λ = S M w M,D,λ , w M,D,λ = (C M,D + λI) -1 S * M,D f ρ ; f M,λ = S M w M,λ , w M,λ = (C M + λI) -1 S * M f ρ . ( ) Definition 3. The maximum random feature dimension is denoted as N ∞ (λ) = sup ω∈Ω (L K + λI) -1/2 ψ(•, ω) 2 ρ , λ > 0. Remark 7 (From Rudi & Rosasco (2017) ). N ∞ (λ) ≤ τ 2 λ -1 is always satisfied for every λ > 0. C APPENDIX: PROOF OF THEOREM 4

C.1 APPENDIX: ERROR DECOMPOSITION FOR DKRR-RF IN EXPECTATION

Proposition 1. Let f 0 M,D,λ , and f M,D,λ , f M,λ , f λ be defined by 3 and 12, respectively. Then, we have E f 0 M,D,λ -f ρ 2 ρ ≤ 3 f M,λ -f λ 2 ρ + 3 f λ -f ρ 2 ρ + 3 m j=1 |D j | 2 |D| 2 E f M,Dj ,λ -f M,λ 2 ρ + 3 m j=1 |D j | |D| E f M,Dj ,λ -f M,λ 2 ρ . Proof. Let f 0 M,D,λ , and f M,D,λ , f M,λ be defined by 3 and 12, respectively. According to the Proposition 5 of (Chang et al., 2017b) or Lemma 20 of (Li et al., 2019b) , we have E f 0 M,D,λ -f M,λ 2 ρ ≤ m j=1 |D j | 2 |D| 2 E f M,Dj ,λ -f M,λ 2 ρ + m j=1 |D j | |D| E f M,D,λ -f M,λ 2 ρ . ( ) Note that (a + b + c) 2 ≤ 3a 2 + 3b 2 + 3c 2 , ∀a, b, c ≥ 0. Thus, we have E f 0 M,D,λ -f ρ 2 ρ = E f 0 M,D,λ -f M,λ + f M,λ -f λ + f λ -f ρ 2 ρ ≤ 3E f 0 M,D,λ -f M,λ 2 ρ + 3 f M,λ -f λ 2 ρ + 3 f λ -f ρ 2 ρ . Combining the above inequality and Eq. 14, we can prove this proposition. Proposition 2. The follows hold: √ λ w M,D,λ -w M,D,λ ≤ J M,D (R M,D + K M,D ) and f M,D,λ -f M,D,λ ρ ≤ J 2 M,D (R M,D + K M,D ). where R M,D := (C M + λI) -1/2 (Φ M,D ȳD -S * M f ρ ) , J M,D := (C M,D + λI) -1/2 (C M + λI) 1/2 , K M,D := (C M + λI) -1/2 (S * M f ρ -S * M,D f ρ ) . Proof. From 13, we know that w M,D,λ = (C M,D + λI) -1 Φ M,D ȳD and w M,D,λ = (C M,D + λI) -1 S * M,D f ρ , so we have w M,D,λ -w M,D,λ = (C M,D + λI) -1 (Φ M,D ȳD -S * M,D f ρ ) =(C M,D + λI) -1/2 (C M,D + λI) -1/2 (C M + λI) 1/2 (C M + λI) -1/2 (Φ M,D ȳD -S * M,D f ρ ). ( ) Note (C M,D + λI) -1/2 is a self-adjoint and positive operator, so (C M,D + λI) -1/2 ≤ 1/ √ λ, thus, we can obtain that w M,D,λ -w M,D,λ ≤ 1 √ λ J M,D (C M + λI) -1/2 (Φ M,D ȳD -S * M,D f ρ ) = 1 √ λ J M,D (C M + λI) -1/2 (Φ M,D ȳD -S * M,D f ρ + S * M,D f ρ -S * M,D f ρ ) ≤ 1 √ λ J M,D (R M,D + K M,D ) . ( ) Note that f M,D,λ -f M,D,λ = S M (w M,D,λ -w M,D,λ ), by 15, we have f M,D,λ -f M,D,λ = S M (w M,D,λ -w M,D,λ ) =S M (C M + λI) -1/2 (C M + λI) 1/2 (C M,D + λI) -1/2 (C M,D + λI) -1/2 (C M + λI) 1/2 (C M + λI) -1/2 (Φ M,D ȳD -S * M,D f ρ + S * M,D f ρ -S * M,D f ρ ). Note that S M (C M + λI) -1/2 = (C M + λI) -1/2 S * M S M (C M + λI) -1/2 1/2 = (C M + λI) -1/2 C M (C M + λI) -1/2 1/2 ≤ 1. So, by Eq 17, we have f M,D,λ -f M,D,λ ρ ≤ J 2 M,D (R M,D + K M,D ) . Proposition 3. The follows hold: √ λ w M,D,λ -w M,λ ≤ f M,λ -f ρ ρ + J M,D f M,λ -f ρ ρ and f M,D,λ -f M,λ ρ ≤ (J M,D + J 2 M,D ) f M,λ -f ρ ρ , where J M,D := (C M,D + λI) -1/2 (C M + λI) 1/2 . Proof. By Remark 6, we have w M,D,λ -w M,λ = (C M,D + λI) -1 S * M,D f ρ -(C M + λI) -1 S * M f ρ =(C M,D + λI) -1 [S * M,D f ρ -S * M f ρ ] + [(C M,D + λI) -1 -(C M + λI) -1 ]S * M f ρ .

Note that for any self-adjoint and positive operators A and B,

A -1 -B -1 = A -1 (B -A)B -1 , A -1 -B -1 = B -1 (B -A)A -1 , (18) so we have w M,D,λ -w M,λ =(C M,D + λI) -1 [S * M,D f ρ -S * M f ρ ] + (C M,D + λI) -1 (C M -C M,D )w M,λ From Lemma 1, we know that C M = S * M S M and C M,D = Φ M,D Φ T M,D = S * M,D S M , thus we can obtain that w M,D,λ -w M,λ =(C M,D + λI) -1 [S * M,D f ρ -S * M f ρ ] + (C M,D + λI) -1 (S * M S M w M,λ -S * M,D S M w M,λ ) =(C M,D + λI) -1 [S * M,D f ρ -S * M,D S M w M,λ ] + (C M,D + λI) -1 [S * M S M w M,λ -S * M f ρ ] =(C M,D + λI) -1 [S * M,D f ρ -S * M,D f M,λ ] + (C M,D + λI) -1 [S * M f M,λ -S * M f ρ ] =(C M,D + λI) -1 S * M,D [f ρ -f M,λ ] + (C M,D + λI) -1 S * M [f M,λ -f ρ ]. C.2 PROOF OF THEOREM 4 Lemma 2 (Lemma 23 in (Li et al., 2019b) , can be also seen in (Rudi & Rosasco, 2017) ). For δ ∈ (0, 1] and λ > 0, when M ≥ Ω N (λ) λ 2r-1 N ∞ (λ) log 1 λ 2-2r ∨ (N ∞ (λ)) log 1 λδ . Then, with probability at least 1 -δ, we have f M,λ -f λ 2 ρ ≤ cλ 2r , where c is a constant. Lemma 3 (Theorem 4 in (Smale & Zhou, 2005) ). Under Assumption 2, for r ∈ [1/2, 1], we have f λ -f ρ 2 ρ ≤ cλ 2r , where c is a constant. Lemma 4 (Lemma 6 in (Rudi & Rosasco, 2017) ). For δ ∈ (0, 1], with probability at least 1 -δ, we have R M,D := (C M + λI) -1/2 (Φ M,D ȳD -S * M f ρ ) = O 1 √ λ|D| + N M (λ) |D| log 1 δ , where N M (λ) := Tr (L M + λI) -1 L M , L M is the integral operator associated with the approxi- mate kernel function K M , (L M f )(x) = X K M (x, x )f (x )dρ X (x ). Lemma 5 (Proposition 10 in (Rudi & Rosasco, 2017) ). For any δ ∈ (0, 1], M ≥ Ω N ∞ (λ) log 1 λδ , then with probability at least 1 -δ, |N M (λ) -N (λ)| ≤ 1.55N (λ), where N M (λ) := Tr (L M + λI) -1 L M . Lemma 6 (Lemma E.2, (Blanchard & Krämer, 2010) ). For any self-adjoint and positive semidefinite operators A and B, if there exists η > 0 such that the following inequality holds (A + λI) -1/2 (B -A)(A + λI) -1/2 ≤ 1 -η, then (A + λI) 1/2 (B + λI) -1/2 ≤ 1 √ η . Proposition 5. For δ ∈ (0, 1], with probability at least 1 -δ, we have K M,D := (C M + λI) -1/2 (S * M f ρ -S * M,D f ρ ) ≤ 2τ ζ log 1 δ 3|D| √ λ + 2ζ N M (λ) |D| , where N M (λ) := Tr (L M + λI) -1 L M . The proof closely follows the proof of Lemma 6 in (Rudi & Rosasco, 2017) . Proof. Define µ i = (C M + λI) 1/2 S * M f ρ -(C M + λI) 1/2 φ M (x i )f ρ (x i ). Note that (C M + λI) 1/2 (S * M f ρ -S * M,D f ρ ) = 1 |D| |D| i=1 µ i . Since µ 1 , . . . , µ |D| are independent and identically distributed random vector, and Eµ i = X (C M + λI) 1/2 φ M (x)f ρ (x)dρ X - X (C M + λI) 1/2 φ M (x i )f ρ (x i )dρ X = 0. To apply the Bernstein inequality (Arcones, 1995; Rudi & Rosasco, 2017) for random vectors, we need to bound their moments. Note that (C M + λI) 1/2 φ M (x)f ρ (x) ≤ τ ζ √ λ , and Eµ 2 i ≤ 2 X (C M + λI) 1/2 φ M (x) 2 f ρ (x) 2 ρ dρ X (x) ≤ 2ζ 2 X (C M + λI) 1/2 φ M (x) 2 dρ X (x) ≤ 2ζ 2 N M (λ). Thus, using the Bernstein inequality (Arcones, 1995; Rudi & Rosasco, 2017) , for any δ ∈ (0, 1], with 1 -δ, we have (C M + λI) -1/2 (S * M f ρ -S * M,D f ρ ) ≤ 2τ ζ log 1 δ 3|D| √ λ + 2ζ N M (λ) |D| . Lemma 7 (Lemma 2 in (Smale & Zhou, 2007) ). Let H be a Hilbert space and ξ be a random variable on Z, ρ with values in H. Assume ξ ≤ M < ∞ almost surely. Denote σ(ξ) = E(ξ 2 ). Let {z i } n i=1 be independent random drawers of ρ. For any 0 < δ < 1, with confidence 1 -δ, 1 n m i=1 [ξ -E(ξ)] ≤ 2M log(2/δ) n + 2 σ(ξ) 2 log(2/δ) n . Proposition 6. For any δ > 0, with probability at least 1 -δ, we have Q M,D := (C M + λI) -1/2 (C M -C M,D )(C M + λ) -1/2 ≤ 2 log 2 (2/δ)(N ∞ (λ) + 1) |D| + 2 log(2/δ)(N ∞ (λ) + 1) |D| . and L M,D := (C M + λI) -1 (C M -C M,D ) ≤ 2 log 2 (2/δ)(N ∞ (λ) + 1) |D| + 2 log(2/δ)(N ∞ (λ) + 1) |D| . where N ∞ (λ) = sup ω∈Ω (L K + λI) -1/2 ψ(•, ω) 2 ρ , c 1 and c 2 are two constants. To prove Proposition 6, we first prove the following lemma (a similar technique can be found in (Hsu et al., 2012; Rudi et al., 2013; Caponnetto & Yao, 2006; Rudi & Rosasco, 2017) ). The similar result for the matrix case was first proved in (Hsu et al., 2012) , and later was extended to the operator case in (Rudi et al., 2013; Rudi & Rosasco, 2017)  (λ) < ∞ such that ζ, (H + λI) -1 ζ ≤ N ∞ (λ). Denote H n as 1 n n i=1 ζ i ⊗ ζ i . Then for any δ ≥ 0, with probability at least 1 -2δ, the following holds (H + λI) -1/2 (H -H n )(H + λ) -1/2 ≤ 2 log 2 (2/δ)(N ∞ (λ) + 1) n + 2 log(2/δ)(N ∞ (λ) + 1) n . Proof. Let H λ = H + λI, η = H -1/2 λ HH -1/2 λ , ξ i = η -H -1/2 λ ζ i ⊗ H -1/2 λ ζ i . One can see that Eξ i = 0. Note that η -H -1/2 λ ζ i ⊗ H -1/2 λ ζ i ≤ η + ζ i , H -1/2 λ ζ i ≤ 1 + N ∞ (λ), E[ξ 2 i ] = E ζ i , H -1/2 λ ζ i H -1/2 λ ζ i ⊗ H -1/2 λ ζ i -H -2 λ H 2 ≤ N ∞ (λ) E[H -1/2 λ ζ i ⊗ H -1/2 λ ζ i ] + H -2 λ H 2 ≤ N ∞ (λ) H -1/2 λ H + H -2 λ H 2 ≤ N ∞ (λ) + 1. Thus, substituting the above two inequalities to Lemma 7 (Lemma 2 in (Smale & Zhou, 2007) ), which finishes the proof. Proof of Proposition 6. Since C M is self-adjoint operator, so we have (C M + λI) -1 (C M -C M,D ) = (C M + λI) -1/2 (C M -C M,D )(C M + λI) -1/2 . According to Lemma 8 with υ i = φ M (x i ), we can obtain this result. Proposition 7. If |D| ≥ 32 log(2/δ)(1 + N ∞ (λ)), then for any δ > 0, with probability at least 1 -δ, we have J M,D := (C M,D + λI) -1/2 (C M + λI) 1/2 ≤ √ 2. Proof. From Proposition 6, we know that if |D| ≥ 32 log(2/δ)(1 + N ∞ (λ)), then (C M + λI) -1/2 (C M,D -C M )(C M + λ) -1/2 ≤ 1 2 . Combining the above inequality and Lemma 6, we can prove this result. Proposition 8. If δ ∈ (0, 1], and |D| ≥ Ω(N ∞ (λ)), then with 1 -δ, we have f M,D,λ -f M,λ ρ = O Υ M,D,λ log 1 δ + f M,λ -f λ ρ + f λ -f ρ ρ , where Υ M,D,λ := 1 √ λ|D| + N (λ) |D| . Proof. From Proposition 4, we have f M,D,λ -f M,λ ρ ≤ J 2 M,D (R M,D + K M,D ) + (J M,D + J 2 M,D ) f M,λ -f ρ ρ . Thus, from Lemmas 4, 5, and Propositions 5, 7, we know that if |D| ≥ Ω(N ∞ (λ)), we can prove this result. Proof of Theorem 4. According to Proposition 1, we have E f 0 M,D,λ -f ρ 2 ρ ≤ 3 f M,λ -f λ 2 ρ + 3 f λ -f ρ 2 ρ + 3 m j=1 |D j | 2 |D| 2 E f M,Dj ,λ -f M,λ 2 ρ + 3 m j=1 |D j | |D| E f M,Dj ,λ -f M,λ 2 ρ . Substituting Lemmas 2,3, Proposition 3, 7, 8 into the above inequality, one can see that if M ≥ Ω N 2r-1 (λ) λ 2r-1 N ∞ (λ) 2-2r ∨ N ∞ (λ) , and |D j | ≥ Ω(N ∞ (λ)), with confidence 1 -δ, we have E f 0 M,D,λ -f ρ 2 ρ = O   λ 2r + m j=1 |D j | 2 |D| 2 Υ 2 M,Dj ,λ log 2 1 δ + m j=1 |D j | |D| λ 2r   , where Υ M,Dj ,λ = 1 √ λ|Dj | + N (λ) |Dj | . If setting |D 1 | = . . . = |D m |, λ = Ω(|D| -1 2r+γ ), we have Υ M,Dj ,λ = O N M (λ) |D j | + 1 |D j | √ λ = O √ m|D| -r 2r+γ + m|D| -4r+2γ-1 4r+2γ . ( ) Note that if m ≤ Ω |D| 2r+γ-1 2r+γ and |D 1 | = . . . = |D m |, and λ = Ω(|D| -1 2r+γ ), we have |D j | = |D| m ≥ Ω |D| 1 2r+γ = Ω(N ∞ (λ)). Thus, substituting 24 into 23, one can see that if m ≤ Ω |D| 2r+γ-1 2r+γ and M ≥ Ω N 2r-1 (λ) λ 2r-1 N ∞ (λ) 2-2r ∨ N ∞ (λ) = Ω |D| 1+(2r-1)γ 2r+γ , with probability at least 1 -δ, we have E f 0 M,D,λ -f ρ 2 ρ = O |D| -2r 2r+γ log 2 1 δ + |D| -1 log 2 1 δ + |D| -2r 2r+γ = O |D| -2r 2r+γ log 2 1 δ . D APPENDIX: PROOF OF THEOREM 5  w0 M,D,λ -w M,D,λ ≤ m j=1 |D j | |D| J 2 M,D (Q M,D + Q M,Dj ) w M,Dj ,λ -w M,λ and f 0 M,D,λ -f M,D,λ ≤ m j=1 |D j | |D| J 2 M,D Q M,D + Q M,Dj f M,Dj ,λ -f M,λ ρ + √ λ w M,Dj ,λ -w M,λ , where J M,D := (C M,D + λI) -1/2 (C M + λI) 1/2 and Q M,D := (C M + λI) -1/2 (C M - C M,D )(C M + λ) -1/2 . Proof. Note that w M,D,λ = (C M,D + λI) -1 Φ M,D ȳD , thus we have w0 M,D,λ -w M,D,λ = m j=1 |D j | |D| (C M,Dj + λI) -1 Φ M,Dj ȳDj -(C M,D + λI) -1 Φ M,D ȳD = m j=1 |D j | |D| (C M,Dj + λI) -1 -(C M,D + λI) -1 Φ M,Dj ȳDj = m j=1 |D j | |D| (C M,D + λI) -1 C M,D -C M,Dj (C M,Dj + λI) -1 Φ M,Dj ȳDj = m j=1 |D j | |D| (C M,D + λI) -1 C M,D -C M,Dj w M,Dj ,λ = m j=1 |D j | |D| (C M,D + λI) -1 (C M,D -C M ) w M,Dj ,λ + m j=1 |D j | |D| (C M,D + λI) -1 C M -C M,Dj w M,Dj ,λ = m j=1 |D j | |D| (C M,D + λI) -1 (C M,D -C M ) (w M,Dj ,λ -w M,λ ) + m j=1 |D j | |D| (C M,D + λI) -1 (C M,D -C M ) w M,λ + m j=1 |D j | |D| (C M,D + λI) -1 C M -C M,Dj w M,Dj ,λ = m j=1 |D j | |D| (C M,D + λI) -1 (C M,D -C M ) (w M,Dj ,λ -w M,λ ) + m j=1 |D j | |D| (C M,D + λI) -1 C M -C M,Dj (w M,Dj ,λ -w M,λ ). ( ) Note that m j=1 |D j | |D| (C M,D + λI) -1 (C M,D -C M ) (w M,Dj ,λ -w M,λ ) = m j=1 |D j | |D| (C M,D + λI) -1 (C M + λI)(C M + λI) -1 (C M,D -C M ) (w M,Dj ,λ -w M,λ ) and m j=1 |D j | |D| (C M,D + λI) -1 C M -C M,Dj (w M,Dj ,λ -w M,λ ) = m j=1 |D j | |D| (C M,D + λI) -1 (C M + λI)(C M + λI) -1 C M -C M,Dj (w M,Dj ,λ -w M,λ ). Substituting the above equations into Eq. 25, we have w0 M,D,λ -w M,D,λ ≤ m j=1 |D j | |D| J 2 M,D (L M,D + L M,Dj ) w M,Dj ,λ -w M,λ , where L M,D := (C M + λI) -1 (C M -C M,D ) . From Proposition 6, we know that L M,D = Q M,D , so we have w0 M,D,λ -w M,D,λ ≤ m j=1 |D j | |D| J 2 M,D (Q M,D + Q M,Dj ) w M,Dj ,λ -w M,λ , which prove the first result of this proposition. In the following, we will prove the second result of this proposition. Note that S M ( w0 M,D,λ - w M,D,λ ) = f 0 M,D,λ -f M,D,λ . According to 25, we have f 0 M,D,λ -f M,D,λ = m j=1 |D j | |D| S M (C M,D + λI) -1 (C M,D -C M ) (w M,Dj ,λ -w M,λ ) + m j=1 |D j | |D| S M (C M,D + λI) -1 C M -C M,Dj (w M,Dj ,λ -w M,λ ) := m j=1 |D j | |D| ( 1 j + 2 j ). Note that 1 j =S M (C M + λI) -1/2 (C M + λI) 1/2 (C M,D + λI) -1/2 (C M,D + λI) -1/2 (C M + λI) 1/2 (C M + λI) -1/2 (C M,D -C M ) (C M + λI) -1/2 (C M + λI) -1/2 (C M + λI)(w M,Dj ,λ -w M,λ ), thus we have 1 j ρ ≤ J 2 M,D Q M,D S M (C M + λI) -1/2 (C M + λI) -1/2 (C M + λI)(w M,Dj ,λ -w M,λ ) ≤ J 2 M,D Q M,D (C M + λI) -1/2 (C M + λI)(w M,Dj ,λ -w M,λ ) . ( ) Since S M (C M + λI) -1/2 = (C M + λI) -1/2 C M (C M + λI) -1/2 1/2 ≤ 1. So, we have 1 j ρ ≤ J 2 M,D Q M,D (C M + λI) -1/2 (C M + λI)(w M,Dj ,λ -w M,λ ) = J 2 M,D Q M,D (C M + λI) -1/2 (S * M S M + λI)(w M,Dj ,λ -w M,λ ) ≤ J 2 M,D Q M,D (C M + λI) -1/2 S * M S M (w M,Dj ,λ -w M,λ ) + λJ 2 M,D Q M,D (C M + λI) -1/2 (w M,Dj ,λ -w M,λ ) ≤ J 2 M,D Q M,D (C M + λI) -1/2 S * M f M,Dj ,λ -f M,D,λ ρ + √ λJ 2 M,D Q M,D (w M,Dj ,λ -w M,λ ) ≤ J 2 M,D Q M,D ( f M,Dj ,λ -f M,D,λ ρ + √ λ w M,Dj ,λ -w M,λ ), the last inequality uses the fact that (C M + λI) -1/2 S * M = (C M + λI) -1/2 S * M S M (C M + λI) -1/2 1/2 ≤ 1. Similar as the above process, we can also obtain that 2 j ρ ≤ J 2 M,D Q M,Dj ( f M,Dj ,λ -f M,D,λ ρ - √ λ w M,Dj ,λ -w M,λ ). Thus, using 26, we can prove this result.

D.2 APPENDIX: PROOF OF THEOREM 5

Proof of Theorem 5. Combining Proposition 9 and Proposition 4, we have f 0 M,D,λ -f M,D,λ ρ ≤ m j=1 |D j | |D| J 2 M,D Q M,D + Q M,Dj (J M,Dj + J 2 M,Dj )(R M,Dj + K M,Dj ) + (2J M,Dj + J 2 M,Dj + 1) f M,λ -f ρ ρ . From Propositions 7 and 8, one can see that if |D j | ≥ Ω(N ∞ (λ)), and λ ≤ L K , then for any δ > 0, with probability at least 1 -δ, f 0 M,D,λ -f M,D,λ ρ =O   m j=1 |D j | |D| Q M,D + Q M,Dj Υ M,Dj ,λ log 1 δ + f M,λ -f λ ρ + f λ -f ρ ρ   , where Υ M,Di,λ = 1 √ λ|Di| + N (λ) |Di| . Note that Q M,D ≤ Q M,Dj , and by Lemmas 2 and 3, so we have f 0 M,D,λ -f M,D,λ ρ = O   m j=1 |D j | |D| Q M,Dj Υ M,Dj ,λ log 1 δ + λ r Q M,Dj   . ( ) Note that f 0 M,D,λ -f ρ ρ = f 0 M,D,λ -f M,D,λ + f M,D,λ -f M,λ + f M,λ -f λ + f λ -f ρ ρ ≤ f 0 M,D,λ -f M,D,λ ρ + f M,D,λ -f M,λ ρ + f M,λ -f λ ρ + f λ -f ρ . Combining E.q 30, Lemmas 2, 3, Proposition 8 and E.q 31, one can see that if M ≥ Ω N 2r-1 (λ) λ 2r-1 N ∞ (λ) 2-2r ∨ N ∞ (λ) , with probability 1 -δ, we have f 0 M,D,λ -f ρ ρ = O   m j=1 |D j | |D| Q M,Dj Υ M,Dj ,λ log 1 δ + Υ M,D,λ log 1 δ + λ r Q M,Dj + λ r   . If |D 1 | = . . . = |D m |, λ = Ω(|D| -1 2r+γ ), we have Υ M,D,λ = O |D| -r 2r+γ + |D| -4r+2γ-1 4r+2γ = O |D| -r 2r+γ , Υ M,Dj ,λ = O √ m|D| -r 2r+γ + m|D| -4r+2γ-1 4r+2γ , Q M,Dj = O m|D| -2r+γ-1 2r+γ + √ m|D| -2r+γ-1 4r+2γ . Thus, when m ≤ Ω |D| 2r+γ-1 4r+2γ , we have Υ M,Dj ,λ Q M,Dj = O |D| -r 2r+γ , Q M,Dj λ r = O |D| -r 2r+γ |D| -2r-1+γ 8r+4γ = O |D| -r 2r+γ , |D j | = |D| m ≥ Ω |D| 2r+γ+1 4r+2γ ≥ Ω |D| 1 2r+γ = Ω(N ∞ (λ)). Thus, we have f 0 M,D,λ -f ρ 2 ρ = O |D| -2r 2r+γ log 2 1 δ , which prove this result. E APPENDIX: PROOF OF THEOREM 6 E.1 APPENDIX: ERROR DECOMPOSITION FOR DKRR-RF-CM IN PROBABILITY Proposition 10. f t M,D,λ -f M,D,λ ρ ≤ m i=1 |D j | |D| (2J 2 M,Dj Q M,D + 2J 2 M,Dj Q M,Dj t f 0 M,D,λ -f M,D,λ ρ + √ λ w0 M,D,λ -w M,D,λ , where J M,D := (C M,D + λI) -1/2 (C M + λI) 1/2 and Q M,D := (C M + λI) -1/2 (C M - C M,D )(C M + λ) -1/2 . Proof. Note that w M,D,λ = wt-1 M,D,λ -(C M,D + λI) -1 (C M,D + λI) wt-1 M,D,λ -Φ M,D ȳD , wt M,D,λ = wt-1 M,D,λ - m j=1 |D j | |D| (C M,Dj + λI) -1 (C M,D + λI) wt-1 M,D,λ -Φ M,D ȳD . Thus, we have  w M,D,λ -wt M,D,λ = wt-1 M,D,λ -(C M,D + λI) -1 (C M,D + λI) wt-1 M,D,λ -Φ M,D ȳD -wt-1 M,D,λ + m j=1 |D j | |D| (C M,Dj + λI) -1 (C M,D + λI) wt-1 M,D,λ -Φ M, Note that S M (C M + λI) -1/2 = (C M + λI) -1/2 C M (C M + λI) -1/2 1/2 ≤ 1, so, we have Combining the above inequality and Proposition 10, and note that Q M,D ≤ Q M,Dj , we can obtain that S M ℵ j 1 ρ ≤ J 2 M,Dj Q M,D (C M + λI) -1/2 (C M + λI) wt-1 M,D,λ -w M,D,λ . f t M,D,λ -f M,D,λ ρ ≤O     m j=1 |D j | |D| Q M,Dj   t   m j=1 |D j | |D| Q M,Dj Υ M,Dj ,λ + f M,λ -f λ + f λ -f ρ ρ     . Note that f t M,D,λ -f ρ ρ = f 0 M,D,λ -f M,D,λ + f M,D,λ -f M,λ + f M,λ -f λ + f λ -f ρ ρ ≤ f t M,D,λ -f M,D,λ ρ + f M,D,λ -f M,λ ρ + f M,λ -f λ ρ + f λ -f ρ . Thus, by Lemmas 2, 3, one can see that if F APPENDIX: PROOF OF THEOREMS 1, 2, 3 M ≥ Ω N 2r-1 (λ) λ 2r-1 N ∞ (λ) 2-2r ∨ N ∞ (λ) , Proof. From (Smale & Zhou, 2007; Caponnetto & Vito, 2007) , if r = 1/2, then f ρ ∈ H K . Thus, in this case, f H K exists and E(f H K ) = E(f ρ ). Note that Assumption 1 is always satisfied for γ = 1 and c = τ 2 . So, using Theorems 4, 5 and 6 with r = 1/2, γ = 1 and c = τ 2 , Theorem 1, 2, and 3 can be proved.



Sutherland & Schneider, 2015;Sriperumbudur & Szabó, 2015;Bach, 2017; Avron et al., 2017b). By applying the standard integral operator framework(Smale & Zhou, 2007;Caponnetto & Vito, 2007), the optimal generalization bounds of KRR with random features were established in(Rudi & Rosasco, 2017), which requires only O( |D| log(|D|)) random features. To decrease the size of random features, an improved approach was proposed based on a novel leverage score sampling strategy(Rudi et al., 2018).Sun et al. (2018) extended the result of(Rudi & Rosasco, 2017) to SVM. In(Li et al., 2019e), they further devised a simple framework for the unified analysis of random Fourier features, which can be applied to KRR, as well as SVM and logistic regression. To further improve the effectiveness, recently,Li et al. (2019b) considered the simple combination of divide-and-conquer and random features. However, to guarantee the optimal generalization performance, the number of local machines should be restricted to a constant, degenerating it into a single random features-based large scale KRR. Because of the selection of the optimal λ using the 5-fold cross validation, the computational complexity should be enlarged. However, it should be noted that even for the plain methods, tuning the optimal λ is also required, which will enlarge the computational complexity as well. http://www.csie.ntu.edu.tw/∼cjlin/libsvm.



and ȳD = 1 √ |D| (y 1 , . . . , y |D| ) T . KRR-RF requires O(M |D|) to store Φ M,D , O(M 3 ) and O(M 2 |D|) time to solve the inverse of (Φ M,D Φ T M,D + λI) and the matrix multiplication Φ M,D Φ M,D , respectively. Thus, the total space and time complexity of KRR-RF are O(M |D|) and O(M 2 |D|), M |D|, respectively.

Dj ȳDj . The space complexity, time complexity and communication complexity of DKRR-RF for each local machine are O(M |D j |), O(M 3 + M 2 |D j |) and O(M ), respectively.

Algorithm Distributed KRR with Random Features and Communications (DKRR-RF-CM) Initialize: w0 M,D,λ = 0 for t = 1 to p do Local machine: compute the local gradient g M,Dj ,λ ( wt-1 M,D,λ ), and communicate back to GM. Global machine: get the global gradient g M,D,λ ( wt-1 M,D,λ ) = m j=1 |Dj | |D| g M,Dj ,λ ( wt-1 M,D,λ ) and communicate to each local machine. Local machine: compute β t-1 j = Φ M,Dj Φ T M,Dj + λI -1 g M,D,λ ( wt-1 M,D,λ ) and communicate back to the global machine. Global machine: compute wt M,D,λ = wt-1 M,D,λw M,Dj ,λ , it is easy to verify that w0 M,D,λ = w -5, and noting that the global gradient g M,D,λ (w) can be achieved via the communications of each local gradient g M,Dj ,λ (w), i.e., g M,D,λ (w) = m j=1 |Dj | |D| g M,Dj ,λ (w), thus, we consider the following Newton Raphson iteration-based communication strategy:

OPTIMAL LEARNING RATE FOR DKRR IN EXPECTATION Theorem 1. Suppose that |ψ(x, ω)| ≤ τ almost surely, τ ∈ [1, ∞) and |y| ≤ ζ. If λ = Ω(|D| -1 2 ), |D 1 | = . . . |D m |, the number of partitions m and the number of random features M respectively correspond to m

Figure 1: The mean square error or error rate on the test set with different partitions on KRR, DKRR-RF and our DKRR-RF-CM. # represents the number of communications.

Figure2: The mean square error or error rate on the test set with different partitions DKRR-RF and our DKRR-RF-CM on minist, a8a, a6a, space-ca, cpusmall and abalone. # represents the number of communications.

Remark 6 (From Caponnetto & Vito (2007); Smale & Zhou (2007); Li et al. (2019b); Lin et al. (

,Dj + λI) -1 -(C M,D + λI) -1 (C M,D + λI) wt-1 M,D,λ -Φ M,D ȳD = m j=1 |D j | |D| (C M,Dj + λI) -1 C M,D -C M,Dj (C M,D + λI) -1 (C M,D + λI) wt-1 M,D,λ -Φ M,D ȳD = m j=1 |D j | |D| (C M,Dj + λI) -1 C M,D -C M,Dj [ wt-1 M,D,λ -w M,D,λ ] ,Dj + λI) -1 [C M,D -C M ] [ wt-1 M,D,λ -w M,D,λ ] ,Dj + λI) -1 C M -C M,Dj [ wt-1 M,D,λ -w M,D,λ ] j 1 = S M (C M + λI) -1/2 (C M + λI) 1/2 (C M,Dj + λI) -1/2 (C M,Dj + λI) -1/2 (C M + λI) 1/2 (C M + λI) -1/2 [C M,D -C M ] (C M + λI) -1/2 (C M + λI) -1/2 (C M + λI) wt-1 M,D,λ -w M,D,λ .

Note that C M = S * M S M , soC M wt-1 M,D,λ -w M,D,λ = S * M S M wt-1 M,D,λ -w M,D,λ = S * M f t-1 M,D,λ -f M,D,λ .Substituting the above inequality into Eq. 36, we haveS M ℵ j 1 ρ ≤ J 2 M,Dj Q M,D (C M + λI) -1/2 S * M f t-1 M,D,λ -f M,D,λ + λJ 2 M,Dj Q M,D (C M + λI) -1/2 wt-1 M,D,λ -w M,D,λ ≤ J 2 M,Dj Q M,D f t-1 M,D,λ -f M,D,λ ρ + √ λ wt-1 M,D,λ -w M,D,λ , the last inequality use the fact that (C M + λI) -1/2 S * M = (C M + λI) -1/2 C M (C M + λI) -1/2 1/2 ≤ 1.Using the same process, we can obtain thatS M ℵ j 2 ρ ≤ J 2 M,Dj Q M,Dj f t-1 M,D,λ -f M,D,λ ρ + √ λ wt-1 M,D,λ -w M,D,λ . Thus, we have f M,D,λ -f t M,D,λ ρ = S M (w M,D,λ -wt M,D,λ ) ρ ≤ ,Dj + λI) -1 [C M,D -C M ] [ wt-1 M,D,λ -w M,D,λ ] ,Dj + λI) -1 C M -C M,Dj [ wt-1 M,D,λ -w M,D,λ ] ,Dj + λI) -1 (C M + λI)(C M + λI) -1 [C M,D -C M ] [ wt-1 M,D,λ -w M,D,λ ] ,Dj + λI) -1 (C M + λI)(C M + λI) -1 C M -C M,Dj [ wt-1 M,D,λ -w M,D,λ ]. (L M,D + L M,Dj ) wt-1 M,D,λ -w M,D,λ , where L M,D := (C M + λI) -1 (C M -C M,D ) . From Proposition 6, we know that L M,D = Q M,D . Thus, we have w M,D,λ -wt M,D,λ ≤ m j=1 J 2 M,Dj (Q M,D + Q M,Dj ) wt-1 M,D,λ -w M,D,λ . Q M,D + J 2 M,Dj Q M,Dj   t f 0 M,D,λ -f M,D,λ ρ + √ λ w0 M,D,λ -w M,D,λ ,which prove the result.E.2 APPENDIX: PROOF OF THEOREM 6Proof of Theorem 6. Substituting Propositions 4, 5, 7, 8 and Lemma 4 into Proposition 9, we havef 0 M,D,λ -f M,D,λ ρ + D + Q M,Dj (R M,D + K M,D + f M,λ -f ρ ρ ) ,Dj + Q M,Dj ) Υ M,Dj ,λ + f M,λ -f λ + f λ -f ρ ρ   .

by setting |D 1 | = . . . = |D m |, λ = Ω(|D| -1 2r+γ ), we know that Υ M,Dj ,λ = O √ m|D| -r 2r+γ + m|D| -

Computational complexity required by different algorithms for the optimal learning rate O 1/ |D|) in the basic setting. Logarithmic terms are not showed.

. Lemma 8. Let ζ 1 , . . . , ζ n with n ≥ 1, be i.i.d random vectors on a separable Hilbert spaces H such that H = Eζ ⊗ ζ is a trace class, and for any λ there exists N ∞

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China NO. 62076234 and the Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098.

annex

Published as a conference paper at ICLR 2021 Thus, we haveNote thatandSubstituting the above two inequalities into Eq. 20, we havewhich prove the first result of Proposition 3.In the following, we will prove the second result. By Eq. 19, we haveProposition 4. The follows hold: Combining Propositions 2 and 3, we can prove this result.

