DISTRIBUTED LEAST SQUARE RANKING WITH RANDOM FEATURES

Abstract

In this paper, we study the statistical properties of pairwise ranking using distributed learning and random features (called DRank-RF) and establish its convergence analysis in probability. Theoretical analysis shows that DRank-RF remarkably reduces the computational requirements while preserving a satisfactory convergence rate. An extensive experiment verifies the effectiveness of DRank-RF. Furthermore, to improve the learning performance of DRank-RF, we propose an effective communication strategy for it and demonstrate the power of communications via theoretical assessments and numerical experiments.

1. INTRODUCTION

Distributed learning has attracted much attention in the literature and has been widely used for kernel learning in large scale scenarios (Zhang et al., 2013; Chang et al., 2017; Lin et al., 2020b) . The distributed kernel learning has mainly three ingredients: Processing the data subset in the local kernel machines and producing a local estimator; Communicating exclusive information such as the data (Bellet et al., 2015) , gradients (Zeng & Yin, 2018) and local estimator (Huang & Huo, 2019) between the local processors and the global processor; Synthesizing the local estimators and the communicated information on the global processor to produce a global estimator. Note that, in the divide-and-conquer learning, the second ingredient communications is not necessary. In the terms of practical challenges and theoretical analysis, the distributed learning has made significant breakthroughs in the multi-penalty regularization (Guo et al., 2019) , coefficient-based regularization (Pang & Sun, 2018) , spectral algorithms (Mücke & Blanchard, 2018; Lin et al., 2020a) , kernel ridgel regression (Yin et al., 2020; 2021) , and semi-supervised regression (Li et al., 2022) . All the above are restricted to pointwise kernel learning. However, the distributed learning in pairwise kernel learning still has a long way to go. The existing distributed pairwise learning (Chen et al., 2019; 2021) has high computational requirements, which motivates us to explore theoretic foundations and efficient methods for pairwise ranking kernel methods under distributed learning. Random features methods (Rahimi & Recht, 2007; Carratino et al., 2018; Liu et al., 2021) have a long and distinguished history, which embed the non-linear feature space (i.e. the Reproducing Kernel Hilbert Space associated with the kernel) into a low dimensional Euclidean space while incurring an arbitrarily small additive distortion in the inner product values. This enables one to overcome the high computational requirements of kernel learning since one can now work in an explicit low dimensional space with explicit representation whose complexity depends only on the dimensionality of the space. Random features have gained rapid progress in reducing the complexity of the kernel ridge regression (Liu et al., 2021) and semi-supervised regression (Li et al., 2022) .However, it remains unclear for complexity reduction and learning theory analysis to the distributed pairwise ranking kernel learning. In this paper, to reduce the computational requirements of pairwise ranking kernel learning, we investigate the method of combining distributed learning and random features for pairwise ranking kernel learning, called distributed least square ranking with random features (DRank-RF), to deal with large-scale applications, and study its statistical properties in probability by integral operators framework. To further improve the performance of DRank-RF, we consider communications among different local processors. The main contributions of this paper are as follows: 1) We construct a novel method DRank-RF to improve the existing state-of-the-art performance of the distributed pairwise ranking kernel learning. This work is the first time to apply random features to least square ranking and derive the theoretical guarantees, which is a new exploration of random features in least square ranking. In theoretical analysis, we derive the convergence rate of the proposed method, which is sharper than that of the existing state-of-the-art distributed pairwise ranking kernel learning (See Theorem 1). In computational complexity, DRank-RF requires essentially O(m 2 |D j |) time and O(m|D j |) memory, where m is the number of random features, m < |D j |, and |D j | is the number of data in each local processor. The proposed method can greatly reduce the computational requirements compared with the state-of-the-art works (See Table 1 ). Experimental results verify that the proposed method keeps the similar testing error as the exact and state-of-the-art approximate kernel least square ranking and has a great advantage over the exact and state-of-the-art approximate kernel least square ranking in the training time, which is consistent with our theoretical analysis. 2) We propose a communication strategy to further improve the performance of DRank-RF, called DRank-RF-C. Statistical analysis shows that DRank-RF-C obtains a faster convergence rate with the help of communication strategy than DRank-RF. And the numerical results validate the power of the proposed communication strategy. The paper is organized as follows: In section 2, we briefly introduce the least square ranking problem and the distributed least square ranking. In section 3, we introduce the proposed methods. Section 4 shows the theoretical analysis of the proposed DRank-RF and DRank-RF-C. In section 5, we compare the related works with the proposed methods. The following sections are the experiments and conclusions.

2. BACKGROUND

There is a compact metric space Z := (X , Y) ⊂ R q+1 , where X ⊂ R q and Y ⊂ [-b, b] for some positive constant b. The sample set D := {(x i , y i )} N i=1 of size N = |D| is drawn independently from an intrinsic Borel probability measure ρ on Z. ρ(y|X = x) denotes the conditional distribution for given input x. The hypothesis space used is the reproducing kernel Hilbert space (RKHS) (H K ) associated with a mercer kernel K : X × X → R (Aronszajn, 1950) . We will denote the inner product in H K by •, • , and corresponding norm by • K .

2.1. LEAST SQUARE RANKING (LSRANK)

Least square ranking (LSRank) is one of the most popular learning methods in the machine learning community (Chen, 2012; Zhao et al., 2017; Chen et al., 2019) ,which can be stated as f D,λ = arg min f ∈H K E D (f ) + λ f 2 K and E D (f ) = 1 |D| 2 |D| i,k=1 (y i -y k -(f (x i ) -f (x k ))) 2 , where the regularized parameter λ > 0. The main purpose of LSRank is to find a function f : X → R through empirical observation, so that the ranking risk E(f ) = Z Z (y -y -(f (x) -f (x ))) 2 dρ(x, y)dρ (x , y ) can be as small as possible, where x, x ∈ X . The optimal predictor (Chen, 2012; Chen et al., 2013; Kriukova et al., 2016) under Eq.( 1) is the regression function f ρ (x) = Y ydρ(y|X = x), x ∈ X . Complexity Analysis LSRank requires O(|D| 3 ) time and O(|D| 2 ) space, which is prohibitive for large-scale settings.

2.2. DISTRIBUTED LEAST SQUARE RANKING (DRANK)

Let the dataset D = ∪ p j=1 D j and each subset D j := x j i , y j j |Dj | i=1 be stored in the j-th local processor for 1 ≤ j ≤ p. The DRank is defined by f 0 D,λ = p j=1 |D j | 2 p k=1 |D k | 2 f Dj ,λ where the least squares ranking (LSRank) f Dj ,λ = arg min f ∈H K E Dj (f ) + λ f 2 K and E Dj (f ) = 1 |Dj | 2 |Dj | i,k=1 y j i -y j k -f x j i -f x j k 2 . Complexity Analysis The time complexity, space complexity, and communication complexity of DRank for each local processor are O(|D j | 3 ), O(|D j | 2 ), and O(|D j |), where j = 1, . . . , p and p is the number of partitions.

3.1. DISTRIBUTED LEAST SQUARE RANKING WITH RANDOM FEATURES (DRANK-RF)

Here we first introduce the main properties of the shift-invariant kernel and the basic idea of random features. The shift-invariant kernel can be written as K(x, x ) = Ω ψ(x, ω)ψ(x , ω) (ω)dω if the spectral measure has a density function (•) (Li et al., 2019; Carratino et al., 2018) , where ψ : X ×Ω → R is a bounded and continuous function with respect to ω and x. The basic idea of random features is to approximate the kernel function K(x, x ) by its Monte-Carlo estimation (Li et al., 2019; Rahimi & Recht, 2007) : K m (x, x ) = 1 m m i=1 ψ(x, ω i )ψ(x , ω i ) = φ m (x), φ m (x ) , where φ m (x) = 1 √ m (ψ(x, ω 1 ), . . . , ψ(x, ω m )) T . Back to supervised learning (Chen, 2012) , combining random features with the least squares ranking leads to, f m,D,λ (x) = g T m,D,λ φ m (x) with g m,D,λ = (Φ m,D W D Φ T m,D + λ 2 I) -1 Φ m,D W D ȳD , for Φ m,D = 1 √ |D| (φ m (x 1 ), . . . , φ m (x |D| )), W D = I |D| -1 |D| 1 |D| 1 T |D| = 1 |D| (|D|I -1 |D| 1 T |D| ), the identity matrix I |D| , 1 |D| = (1, . . . , 1) T ∈ R |D| , and ȳD = 1 √ |D| (y 1 , . . . , y |D| ) T . DRank with random features (DRank-RF) is defined as f 0 m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 f m,Dj ,λ , where f m,Dj ,λ = g T m,Dj ,λ φ m (x) with g m,Dj ,λ = (Φ m,Dj W Dj Φ T m,Dj + λ 2 I) -1 Φ m,Dj W Dj ȳDj . Random features have a long history and have been studied in different learning, for example kernel ridge regression (Liu et al., 2021) , kernel classification (Liu et al., 2022) , kernel k-means (Chitta et al., 2012) . However, random features have not been studied in least square ranking. Our work is the first time to apply random features to least square ranking and derive the theoretical guarantees, which is a new exploration of the application of random features. In addition, due to the different objective functions and integral operators, the proof of our proposed method is different from the existing methods (See Appendix). Finally, the proposed methods greatly reduce the computational requirements (See Table 1 ). The method of synthesis operation in Eq.( 4) is to weighted average the estimated values in each local processor. The way of weighted averaging in Eq.( 4) cannot improve the approximation ability of DRank-RF in each local processor (Huang & Huo, 2019; Lin et al., 2020b; Yin et al., 2021) . To further improve the performance, we bring an efficient communication strategy into DRank-RF. In this section, we introduce the DRank-RF with communications (DRank-RF-C), which can not only improve the approximation ability but also protect the data privacy in each local processor.

Complexity Analysis

For any g, according to Eq.( 3), one has the following equation: g m,D,λ =g -(Φ m,D W D Φ T m,D + λ 2 I) -1 [(Φ m,D W D Φ T m,D + λ 2 I)g -Φ m,D W D ȳD ] =g -(Φ m,D W D Φ T m,D + λ 2 I) -1 G m,D,λ (g), where G m,D,λ (g) = (Φ m,D W D Φ T m,D + λ 2 I)g -Φ m,D W D ȳD . Define ḡ0 m,D,λ = p j=1 |Dj | 2 p k=1 |D k | 2 g m,Dj ,λ , we can obtain that ḡ0 m,D,λ = g - p j=1 |D j | 2 p k=1 |D k | 2 (Φ m,Dj W Dj Φ T m,Dj + λ 2 I) -1 G m,Dj ,λ (g). Note that, the gradient of the empirical risk of 1 |Dj | 2 y i -y k -(g T φ m (x i ) -g T φ m (x k )) 2 + λ g 2 on g is 4G m,Dj ,λ (g) for all (x i , y i ), (x k , y k ) ∈ D j . Comparing Eq.( 5) and Eq.( 6), we consider the communication strategy based on the well-known Newton Raphson iteration (Lin et al., 2020b; Yin et al., 2021; Chen et al., 2021) for DRank-RF, which is formed as: ḡl m,D,λ = ḡl-1 m,D,λ - p j=1 |D j | 2 p k=1 |D k | 2 β l-1 j , where β l-1 j = H -1 Dj ,λ Ḡm,D,λ (ḡ l-1 m,D,λ ), Ḡm,D,λ (g) = p j=1 |D j | 2 p k=1 |D k | 2 G m,Dj ,λ (g), H Dj ,λ = Φ m,Dj W Dj Φ T m,Dj + λ 2 I, and l is the number of iteration. The method of synthesis operation in DRank-RF-C is to weighted average the model parameters {β j } of each local processor obtained in the last iteration. Algorithm 1 shows the detail of DRank-RF-C. In step 1, let ḡ0 m,D,λ be 0. In the following steps, we update the global gradients and model parameters iteratively. For l = 1, . . . , M , we distribute ḡl-1 m,D,λ to each local processor. In step 2 (on each local processor), compute p local gradient vectors G m,Dj ,λ (ḡ l-foot_0 m,D,λ ) and communicate them back to the global processor. In step 3 (on global processor), according to the received p local gradient vectors, we compute the global gradient Ḡm,D,λ (ḡ l-1 m,D,λ ) and communicate it to each local processor. In step 4 (on each local processor), each local processor computes β l-1 j and communicates them back to the global processor. In step 5 (on global processor), the global processor obtains the solution ḡl m,D,λ . Then we transmit ḡl m,D,λ to each local processor and go back to step 2. Complexity Analysis In the terms of time complexity, one needs to compute the inverse of Φ m,Dj W Dj Φ T m,Dj + λ 2 I and the matrices multiplication Φ m,Dj W Dj Φ T m,Dj once for each local processor, and one needs to solve the local gradient G m,Dj ,λ and model parameter β j in each iteration for each local processor. Thus, the total time complexity for each local processor is O(m 2 |D j | + mM |D j |) , where M is the number of communications. In the terms of space complexity, for each local processor, the key is to store Φ m,Dj , thus the space complexity of each local processor is O(m|D j |). In the terms of communication complexity, the global processor sends the gradient Ḡm,D,λ and ḡl m,D,λ to each local processor and receives the local gradient G m,Dj ,λ and model parameter β j from each local processor in each iteration. Therefore, the total communication complexity is O(mM ). Note that, if the number of communications M ≤ m, the time complexity and space complexity of DRank-RF-C are the same as those of DRank-RF.

4. THEORETICAL ANALYSIS

Here, we analyze the convergence rate of DRank-RF and DRank-RF-C in probability. Define the optimal hypothesis f λ in H K as f λ = arg min f ∈H K E(f ) + λ f 2 K . We assume f λ exists.

4.1. CONVERGENCE RATE OF DRANK-RF

In the following, we state and discuss the convergence rate of DRank-RF in probability. , for f 0 m,D,λ defined in Eq.( 4) and every δ ∈ (0, 1], there holds Theorem 1. Suppose ψ is continuous, such that |ψ(x, ω)| ≤ τ almost surely, τ ∈ [1, ∞). As- sume that L -r K f ρ ∈ H K with 0 < r ≤ 1, f 0 m,D,λ -f ρ K = O p j=1 |Dj | p k=1 |D k | 2 r 1+r log 1 δ with confidence at least 1 -δ. Remark 1. From Theorem 1 mentioned above, one can see that if the number of random features m is Ω p j=1 |Dj | p k=1 |D k | 2 -2r , the convergence rate of the proposed DRank-RF can reach O p j=1 |Dj | p k=1 |D k | 2 r 1+r 1 , which is sharper than the existing convergence rate O p j=1 |Dj | 3/2 p k=1 |D k | 2 r 1+r of the state-of-the-art distributed pairwise ranking kernel learning (Chen et al., 2021) . When the number of partitions p = 1, the convergence rate of the proposed DRank-RF is O |D| . Theoretical analysis demonstrates that the proposed DRank-RF is sound and effective. Remark 2. From a theoretical perspective, this paper is a non-trivial extension of these approximate pairwise ranking methods. The existing papers mainly use capacity concentration estimation (Rudin, 2009; Rudin & Schapire, 2009; Rejchel, 2012) and algorithmic stability (Cossock & Zhang, 2008; Chen et al., 2014) for the learning theory analysis of pairwise ranking. In this paper, we apply the integral operator framework and introduce a novel technique of error decomposition so that the proposed method can achieve a tight bound under the basic condition. The details can be seen in Table 1 : The computational complexity of different algorithms. m is the number of random features and m < |Dj|. M is the number of communications. q is the dimension of data. |Dj| < |D|.

Algorithms Time Space Communication

LSRank (Chen et al., 2019 ) (Chen et al., 2021; 2019 ) |D| 3 |D| 2 / DRank |D j | 3 |D j | 2 |D j | DRank-C(Chen et al., 2021) |D j | 3 + M |D j ||D| |D j | 2 qM |D| DRank-RF (This Paper) m 2 |D j | m|D j | m DRank-RF-C (This Paper) m 2 |D j | + mM |D j | m|D j | mM Appendix. This is the first time that combined distributed learning and random features in LSRank and achieved such a breakthrough.

4.2. CONVERGENCE RATE OF DRANK-RF-C

Here, we introduce and discuss the convergence analysis of DRank-RF-C in probability. , for every Theorem 2. Suppose ψ is continuous, such that |ψ(x, ω)| ≤ τ almost surely, τ ∈ [1, ∞). Assume that L -r K f ρ ∈ H K with 0 < r ≤ 1, where L r K is the r-th power of L K . If λ = O(|D| -1 1+r ), |D 1 | = . . . = |D p | = |D| p , δ ∈ (0, 1], with confidence at least 1 -δ, we have f M m,D,λ -f ρ K = O p 1 2 |D| -r 2(1+r) M +2 . Proof. The proof of Theorem 1 and 2 is in Appendix. The assumption of L -r K f ρ ∈ H K with 0 < r ≤ 1 is commonly used in approximation theory (Smale & Zhou, 2007) , which can be seen as regularity assumption. Remark 3. Theoretical analysis shows that, when p < |D| rM rM +M +2 , the convergence rate of DRank-RF-C is sharper than that of DRank-RF at the same settings. Note that p is monotonically increasing with the number of communications M , which can demonstrate the power of the proposed communications. For M → ∞, it is clear that the convergence rate of DRank-RF-C is always sharper than that of DRank-RF. The convergence rate in Theorem 2 is also related to δ. To simplify the representation, we omit it here. Their detailed relationship is shown in Appendix C.2.

5. COMPARED WITH THE RELATED WORKS

In this section, we introduce the related distributed pairwise ranking in kernel learning. In Chen et al. ( 2019 ), respectively. However, its communication strategy requires communicating the input data between each local processor. Thus, it is difficult to protect the data privacy of each local processor. Furthermore, for each iteration, the communication complexity of each local processor is O(qM |D|), where q denotes the dimension, which is infeasible in practice for large-scale datasets. Table 1 shows the detail complexity of the related methods. We see that the than other methods. For DRank-RF-C, it requires less complexity than the communication-based method. In addition, the communication strategy proposed in this paper only requires communicating the gradient and the model parameters, rather than the data, therefore the proposed DRank-RF-C do better on privacy protection. The convergence rate of the proposed DRank-RF in Theorem 1 is sharper than the convergence rate O p j=1 |Dj | 3/2 p k=1 |D k | 2 r 1+r of the existing state-of-the-art DRank (without communications) (Chen et al., 2021; 2019) . And the convergence rate of the proposed DRank-RF-C in Theorem 2 is also sharper than the convergence rate O max p 1 2 |D| -r 2(1+r) M +1 , |D| -r 2(1+r) of the existing communication-based DRank (Chen et al., 2021) .

6. EMPIRICAL EVALUATIONS

We perform experiments to validate our theoretical analysis of DRank-RF and the communication strategy on simulated and real datasets. The server is 32 cores (2.40GHz) and 32 GB of RAM.

6.1. PARAMETERS AND CRITERION

We use the Gaussian kernel K (x, x ) = exp -xx 2 2 /(2d 2 ) . The optimal bandwidth d ∈ 2 [-2:0.5:5] and regularization parameter λ ∈ 2 [-13:2:-3] are selected via 5-fold cross-validation. The criterion of evaluating the methods on testing data is as follows (Chen et al., 2021; Kriukova et al., 2016)  : R(f ) = n i,j=1 I {(y i >y j )∧( f (x i )≤ f (x j ))} n i,j=1 I {y i >y j } , where I {ϕ} is 1 if ϕ is true and 0 otherwise. We use the exact LSRank as a baseline, which trains all samples in a batch. And we compare the proposed DRank-RF and DRank-RF-C (M = 2, 4, 8) with DRank, DRank-C, and LSRank by carrying out various settings. We repeat the training 5 times and estimate the error on testing data.  y i = [ x i /7] + i , 1 ≤ i ≤ |D |, where [•] means the integer part of inputs and i is the noise independently sampled from Gaussian distribution N (0, 0.01). Dimension q is 7. We generate 20000 samples. 70% is used for training and 30% for testing. ference between DRank-RF and DRank, both of which are close to the optimal level. These results are consistent with our theoretical analysis. With the increase of the number of random features m, the training time of DRank-RF increases, and the testing error becomes smaller, which are in line with the theoretical reasoning. And the testing error of DRank-RF declines significantly when m is a small number. Therefore, in practice, we only need to take a small m to obtain a satisfactory error, which will result in the savings of computing resources. Note that, DRank and LSRank have nothing to do with m. Figure 2 shows the relation between the testing error, p, and different numbers of communications (M = 2, 4, 8) with m = 300 and indicates the following information: 1) With the increase of p, the testing error gaps between p-related algorithms and exact LSRank become larger and larger. There exists an upper bound of p for DRank-RF and DRank-RF-C respectively, when larger than it, the testing error increases and is far from the exact LSRank. This is in line with Theorem 1 and Theorem 2. 2) The upper bound p of DRank-RF-C is much larger than DRank-RF, which is aligned with our theoretical analysis that the bound of p is determined by the communication times. 3) Under the same p, the performance of DRank-RF-C is better than DRank-RF. And with the increase of the number of communications M , the testing error of DRank-RF-C is smaller. These verify the power of the communication strategy for DRank-RF. 4) Under the same conditions, the testing errors of the proposed DRank-RF and DRank-RF-C are similar to those of DRank and DRank-C.

6.3. REAL DATA

The real dataset of MovieLens is from website http://www.grouplens.org/taxonomy/term/14, which is a 62423 × 162541 rating matrix where (i, j) entry is the rating score of the j-th reviewer on the i-th movie. We group the reviewers into 500-1000 movies according to the number of movies they have rated. And 500 reference reviewers are selected at random from this group. In addition, we select the test reviewers from those users who had rated more than 5000 movies. So, we obtain a small matrix with the scale of at least 5000 × 501, where the last column corresponds to the test reviewer and the other columns correspond to the 500 reference reviewers. Then the columns without non-zero elements are deleted and the rows without the rating of any reference reviewers or the test reviewer are deleted. Finally, missing review values of every left movie were replaced with the median review score of those left reference reviewers on this movie. Here, we obtain a smaller matrix. Each row of it is a data pair (x i , y i ) and the last entry was the label y i of the input features x i . The experimental dataset is similar to that in Chen et al. (2021) . On the obtained dataset, 70% is used for training and 30% for testing. The empirical evaluations are given in Table 2 where m = 100, 150 and p = 2, 10, 15. In Table 2 , we can find that the experimental results are similar to those on the simulated data. The average testing error gaps between our methods and the exact methods are particularly small, which verify the effectiveness of our methods on the real dataset. 

7. CONCLUSIONS

We propose a novel pairwise ranking method (DRank-RF) to scale to large-scale scenarios. Our work is the first time to apply random features to least square ranking, which is a new exploration of the application of random features. Our theoretical analysis based on the techniques of integral operators shows that its convergence rate is sharper than that of the existing state-of-the-art DRank without communications. In computational 

A PRELIMINARY DEFINITIONS

There is a compact metric space Z := (X , Y) ⊂ R q+1 , where X ⊂ R q and Y ⊂ [-b, b] for some positive constant b. The sample set D := {(x i , y i )} N i=1 of size N = |D| is drawn independently from an intrinsic Borel probability measure ρ on Z. ρ(y|X = x) denotes the conditional distribution for given input x. The hypothesis space used is the reproducing kernel Hilbert space (RKHS) (H K ) associated with a mercer kernel K : X × X → R (Aronszajn, 1950; Cucker & Zhou, 2007) . We will denote the inner product in H K by •, • , and corresponding norm by • K . K x = K(x, •). Let ρ X be the margin distribution of ρ with respect to X and L 2 ρ X be the Hilbert space of ρ X square integrable functions on X . The Mercer kernel K defines an integral operator L K on H K (or L 2 ρ X ) (Chen et al., 2021) by L K f = X X f (x) (K x -K x ) dρ X (x)dρ X (x ) . Suppose ψ is continuous, such that |ψ(x, ω)| ≤ τ almost surely, τ ∈ [1, ∞). Assume that L -r K f ρ ∈ H K with 0 < r ≤ 1, where L r K is the r-th power of L K . Before the proof, we give some definitions: S m : R m → L 2 ρX , (S m g) (x) = g, φ m (x) , S * m : L 2 ρ X → R m , S * m f = X φ m (x)f (x)dρ X (x), S * m,D : L 2 ρ X → R m , S * m,D f = 1 |D| xj ∈D X φ m (x j ) f (x j ) . S * m S m and Φ m,D Φ T m,D = S * m,D S m are self-adjoint and positive operators, with spectrum is 0, τ 2 (Caponnetto & Vito, 2007) . This part is organized as follows: In section B, we introduce the proof of Theorem 1. Section B.1 contains the main lemmas used for the proof of Theorem 1 and Theorem 2. Section B.2 is the detail proof process of Theorem 1. In section C, we introduce the proof of Theorem 2. Section C.1 contains the main lemmas used for the proof of Theorem 2. Section C.2 is the detail proof process of Theorem 2. In section D, we introduce the propositions used for the proof of Theorem 1 and Theorem 2. Section E is the experiments on Jester Joke dataset.

B PROOF OF THEOREM 1

B.1 BOUND TERMS Lemma 1. We have √ λ g m,D,λ -g m,λ ≤ √ 2 Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 * S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) + S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) + 1 + Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 f m,λ -f ρ K . Proof. Note that g m,D,λ -g m,λ ≤ g m,D,λ -g m,D,λ + g m,D,λ -g m,λ . Define f m,D,λ = g T m,D,λ φ m (•), g m,D,λ = arg min g∈R m 1 |D| 2 zi,z k ∈D (g T φ m (x i ) -y i ) -(g T φ m (x k ) -y k ) 2 + λ g 2 , f m,D,λ = g T m,D,λ φ m (•), g m,D,λ = arg min g∈R m 1 |D| 2 zi,z k ∈D (g T φ m (x i ) -f ρ (x i )) -(g T φ m (x k ) -f ρ (x k )) 2 + λ g 2 . One can have f m,D,λ = S m g m,D,λ , g m,D,λ = Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D ȳD , f m,D,λ = S m g m,D,λ , and g m,D,λ = Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D W D f ρ , so we have g m,D,λ -g m,D,λ = Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D ȳD -S * m,D W D f ρ = Φ m,D W D Φ T m,D + λ 2 I -1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 * S * m S m + λ 2 I -1/2 Φ m,D W D ȳD -S * m,D W D f ρ . (10) Note that Φ m,D W D Φ T m,D + λ 2 I -1/2 ≤ 2/λ. Thus we can obtain that g m,D,λ -g m,D,λ ≤ 2/λ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 * S * m S m + λ 2 I -1/2 Φ m,D W D ȳD -S * m,D W D f ρ = 2/λ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 * S * m S m + λ 2 I -1/2 Φ m,D W D ȳD -S * m W D f ρ + S * m W D f ρ -S * m,D W D f ρ ≤ 2/λ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 * S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) + S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) . (11) Define f m,λ = g T m,λ φ m (•) with g m,λ = arg min g∈R m Z Z (g T φ m (x) -f ρ (x)) -(g T φ m (x ) -f ρ (x )) 2 dρ X (x, y)dρ X (x , y ) + λ g 2 . We know f m,λ = S m g m,λ and g m,λ = S * m W D S m + λ 2 I -1 S * m W D f ρ . So one can obtain g m,D,λ -g m,λ = Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D W D f ρ -S * m W D S m + λ 2 I -1 S * m W D f ρ = Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D W D f ρ -S * m W D f ρ + Φ m,D W D Φ T m,D + λ 2 I -1 -S * m W D S m + λ 2 I -1 S * m W D f ρ . For any self-adjoint and positive operators A and B, A -1 -B -1 = A -1 (B -A)B -1 , A -1 -B -1 = B -1 (B -A)A -1 , so we have g m,D,λ -g m,λ = Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D W D f ρ -S * m W D f ρ + Φ m,D W D Φ T m,D + λ 2 I -1 S * m W D S m -Φ m,D W D Φ T m,D g m,λ < Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D W D f ρ -S * m W D f ρ + Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m -Φ m,D Φ T m,D g m,λ . We know that Φ m,D Φ T m,D = S * m,D S m (Caponnetto & Vito, 2007) , thus we can obtain that g m,D,λ -g m,λ < Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D W D f ρ -S * m W D f ρ + Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m g m,λ -S * m,D S m g m,λ ≤ Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D f ρ -S * m,D S m g m,λ + Φ m,D W D Φ T m,D + λ 2 I -1 [S * m S m g m,λ -S * m f ρ ] = Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D f ρ -S * m,D f m,λ + Φ m,D W D Φ T m,D + λ 2 I -1 [S * m f m,λ -S * m f ρ ] = Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D [f ρ -f m,λ ] + Φ m,D W D Φ T m,D + λ 2 I -1 S * m [f m,λ -f ρ ] . Thus, we have g m,D,λ -g m,λ ≤ Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D + Φ m,D W D Φ T m,D + λ 2 I -1 S * m f m,λ -f ρ K . Note that Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m,D ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 Φ m,D W D Φ T m,D Φ m,D W D Φ T m,D + λ 2 I -1/2 1/2 ≤ 1 and Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m = Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 S * m S m + λ 2 I -1/2 S * m ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 S * m S m + λ 2 I -1/2 S * m , since S * m S m + λ 2 I -1/2 S * m = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 ≤ 1. Substituting the above two inequalities into Eq.( 13) we have g m,D,λ -g m,λ ≤ 1 √ λ 1 + Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 f m,λ -f ρ K . ( ) Combining Eq.( 11) and Eq.( 14), we finish this proof. Lemma 2. We have f m,D,λ -f m,λ K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) + S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) +   Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 + Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2   * f m,λ -f ρ K . Proof. Note that f m,D,λ -f m,λ K ≤ f m,D,λ -f m,D,λ K + f m,D,λ -f m,λ K . According to f m,D,λ -f m,D,λ = S m g m,D,λ -g m,D,λ , by Eq.( 10), we have f m,D,λ -f m,D,λ = S m g m,D,λ -g m,D,λ =S m S * m S m + λ 2 I -1/2 S * m S m + λ 2 I 1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 * S * m S m + λ 2 I 1/2 S * m S m + λ 2 I -1/2 Φ m,D W D ȳD -S * m W D f ρ + S * m W D f ρ -S * m,D W D f ρ . ( ) Note that S m S * m S m + λ 2 I -1/2 = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 ≤ 1. So, by Eq.( 15) we have f m,D,λ -f m,D,λ K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) + S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) . ( ) Similarly, according to Eq.( 12), we have f m,D,λ -f m,λ = S m g m,D,λ -g m,λ ≤S m Φ m,D W D Φ T m,D + λ 2 I -1 S * m,D [f ρ -f m,λ ] + S m Φ m,D W D Φ T m,D + λ 2 I -1 S * m [f m,λ -f ρ ] =S m S * m S m + λ 2 I -1/2 S * m S m + λ 2 I 1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 * S * m,D [f ρ -f m,λ ] + S m S * m S m + λ 2 I -1/2 S * m S m + λ 2 I 1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 * Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 S * m S m + λ 2 I -1/2 S * m [f m,λ -f ρ ] . Note that S m S * m S m + λ 2 I -1/2 = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 ≤ 1, so we have f m,D,λ -f m,λ K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 + Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2   f m,λ -f ρ K . ( ) Combining Eq.( 16) and Eq.( 17), we finish this proof. Lemma 3. For δ ∈ (0, 1], with probability at least 1 -δ, we have S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) = O 1 √ λ|D| + N m (λ) |D| log 1 δ , where N m (λ) = Tr L m + λ 2 I -1 L m , L m is the integral operator associated with the approx- imate kernel function K m , (L m f ) (x) = X K m (x, x ) f (x ) dρ X (x ). Proof. We have S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) ≤ S * m S m + λ 2 I -1/2 (Φ m,D ȳD -S * m f ρ ) . According to Lemma 6 in Rudi & Rosasco (2017) , we know, with probability at least 1 -δ, S * m S m + λ 2 I -1/2 (Φ m,D ȳD -S * m f ρ ) = O 1 √ λ|D| + N m (λ) |D| log 1 δ . where N m (λ) = Tr L m + λ 2 I -1 L m , L m is the integral operator associated with the approx- imate kernel function K m , (L m f ) (x) = X K m (x, x ) f (x ) dρ X (x ) . Thus, we complete this proof. Lemma 4. For δ ∈ (0, 1], with probability at least 1 -δ, we have S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) ≤ τ ζ log 1 δ |D| √ λ + 2ζ N m (λ) |D| , where N m (λ) = Tr L m + λ 2 I -1 L m . Proof. We have S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) ≤ S * m S m + λ 2 I -1/2 S * m f ρ -S * m,D f ρ . According to Proposition 5 in Liu et al. (2021) , with probability at least 1 -δ, we have S * m S m + λ 2 I -1/2 S * m f ρ -S * m,D f ρ ≤ τ ζ log 1 δ |D| √ λ + 2ζ N m (λ) |D| , where N m (λ) = Tr L m + λ 2 I -1 L m . Combining them, we complete this proof. Lemma 5. For any δ > 0, with probability at least 1 -δ, we have S * m S m + λ 2 I -1 S * m S m -Φ m,D W D Φ T m,D = S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 ≤ 2 log 2 (2/δ) 2τ 2 λ -1 + 1 |D| + 2 log(2/δ) (2τ 2 λ -1 + 1) |D| . Proof. Since S * m S m is self-adjoint operator, so we have S * m S m + λ 2 I -1 S * m S m -Φ m,D W D Φ T m,D = S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 . According to Proposition 1 with ζ i = φ m (x i ), we can obtain S * m S m + λ 2 I -1 S * m S m -Φ m,D W D Φ T m,D ≤ 2 log 2 (2/δ) (N ∞ (λ) + 1) |D| + 2 log(2/δ) (N ∞ (λ) + 1) |D| , where (Rudi & Rosasco, 2017) , c 1 and c 2 are two constants. N ∞ (λ) = sup ω∈Ω LK + λ 2 I -1/2 ψ(•, ω) 2 K ≤ 2τ 2 λ -1 , LK f = X K(x, •)f (x)dρ X Therefore, we have S * m S m + λ 2 I -1 S * m S m -Φ m,D W D Φ T m,D ≤ 2 log 2 (2/δ) 2τ 2 λ -1 + 1 |D| + 2 log(2/δ) (2τ 2 λ -1 + 1) |D| . Lemma 6. We have ḡ0 m,D,λ -g m,D,λ ≤ p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 + S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 * g m,Dj ,λ -g m,λ . Proof. Note that g m,D,λ = (Φ m,D W D Φ T m,D + λ 2 I) -1 Φ m,D W D ȳD . Thus we have ḡ0 m,D,λ -g m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 (Φ m,Dj W Dj Φ T m,Dj + λ 2 I) -1 Φ m,Dj W Dj ȳDj -(Φ m,D W D Φ T m,D + λ 2 I) -1 Φ m,D W D ȳD = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 -Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,Dj W Dj ȳDj = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -Φ m,Dj W Dj Φ T m,Dj * Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 Φ m,Dj W Dj ȳDj = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -Φ m,Dj W Dj Φ T m,Dj g m,Dj ,λ By introducing S * m S m term, we can convert the above formula into ḡ0 m,D,λ -g m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -S * m S m g m,Dj ,λ + p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m -Φ m,Dj W Dj Φ T m,Dj g m,Dj ,λ = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -S * m S m g m,Dj ,λ -g m,λ + p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -S * m S m g m,λ + p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m -Φ m,Dj W Dj Φ T m,Dj g m,Dj ,λ . So we have ḡ0 m,D,λ -g m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -S * m S m g m,Dj ,λ -g m,λ Term-A + p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m -Φ m,Dj W Dj Φ T m,Dj g m,Dj ,λ -g m,λ Term-B . Note that Term-A = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m + λ 2 I S * m S m + λ 2 I -1 * Φ m,D W D Φ T m,D -S * m S m g m,Dj ,λ -g m,λ and Term-B = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m + λ 2 I * S * m S m + λ 2 I -1 S * m S m -Φ m,Dj W Dj Φ T m,Dj g m,Dj ,λ -g m,λ . Substituting the above equations into Eq.( 21), we have ḡ0 m,D,λ -g m,D,λ ≤ p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 + S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 * g m,Dj ,λ -g m,λ . Here, we complete this proof. Lemma 7. We have f 0 m,D,λ -f m,D,λ ≤ p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 + S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 * f m,Dj ,λ -f m,λ K + √ λ g m,Dj ,λ -g m,λ . Proof. Note that S m ḡ0 m,D,λ -g m,D,λ ) = f 0 m,D,λ -f m,D,λ . According to Eq.( 21), we have f 0 m,D,λ -f m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 S m Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D -S * m S m g m,Dj ,λ -g m,λ Term-A + p j=1 |D j | 2 p k=1 |D k | 2 S m Φ m,D W D Φ T m,D + λ 2 I -1 S * m S m -Φ m,Dj W Dj Φ T m,Dj g m,Dj ,λ -g m,λ Term-B . Note that Term-A =S m S * m S m + λ 2 I -1/2 S * m S m + λ 2 I 1/2 Φ m,D W D Φ T m,D + λ 2 I -1/2 * Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 * S * m S m + λ 2 I -1/2 Φ m,D W D Φ T m,D -S * m S m S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m + λ 2 I g m,Dj ,λ -g m,λ .

So, we have

Term-A K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S m S * m S m + λ 2 I -1/2 S * m S m + λ 2 I -1/2 S * m S m + λ 2 I g m,Dj ,λ -g m,λ ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m + λ 2 I g m,Dj ,λ -g m,λ . Since S m S * m S m + λ 2 I -1/2 = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 ≤ 1. Term-A K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m + λ 2 I g m,Dj ,λ -g m,λ = Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m + λ 2 I g m,Dj ,λ -g m,λ ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m g m,Dj ,λ -g m,λ + λ 2 Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 g m,Dj ,λ -g m,λ ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m f m,Dj ,λ -f m,D,λ K + √ λ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * g m,Dj ,λ -g m,λ ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * f m,Dj ,λ -f m,D,λ K + √ λ g m,Dj ,λ -g m,λ , the last inequality uses the fact that S * m S m + λ 2 I -1/2 S * m = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 ≤ 1. Similar as the above process, we can obtain that Term-B K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 * f m,Dj ,λ -f m,D,λ K - √ λ g m,Dj ,λ -g m,λ . Combining Eq.( 23), Eq.( 24), and Eq.( 25), we obtain this result. Lemma 8. For δ ∈ (0, 1] and λ > 0, when m = Ω λ -2r ∨ λ -1 log 1 λδ , with probability at least 1 -δ, we have f m,λ -f λ K ≤ cλ r , where c is a constant. Proof. Note that f m,λ = S m g m,λ and g m,λ = S * m W D S m + λ 2 I -1 S * m W D f ρ . We have f m,λ -f λ K = S m S * m W D S m + λ 2 I -1 S * m W D f ρ -f λ K ≤ S m S * m S m + λ 2 I -1 S * m f ρ -fλ K , where fλ = arg min f ∈H K { X (f (x) -f ρ (x)) 2 dρ X (x) + λ f 2 K }. According to Lemma 2 in Liu et al. (2021) (can be also seen in Li et al. (2019) and Rudi & Rosasco (2017) ), one has, when m = Ω λ -2r ∨ λ -1 log 1 λδ , with probability at least 1 -δ, S m S * m S m + λ 2 I -1 S * m f ρ -fλ K ≤ cλ r . Combining the above, we complete this proof.

B.2 PROOF OF THEOREM 1

Proof. We have f 0 m,D,λ -f ρ K = f 0 m,D,λ -f m,D,λ + f m,D,λ -f m,λ + f m,λ -f λ + f λ -f ρ K ≤ f 0 m,D,λ -f m,D,λ K + f m,D,λ -f m,λ K + f m,λ -f λ K + f λ -f K . Combining Lemma 1, Lemma 2, and Lemma 7, we have f 0 m,D,λ -f m,D,λ K ≤ p j=1 |D j | 2 p k=1 |D k | 2 Φ m,D W D Φ T m,D + λ 2 I -1/2 SS 1/2 λ 2 * SS -1/2 λ S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 + SS -1/2 λ S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 *     √ 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 SS 1/2 λ + Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 SS 1/2 λ 2   * SS -1/2 λ Φ m,Dj W Dj ȳDj -S * m W Dj f ρ + S * m S m + λ 2 I -1/2 (S * m W Dj f ρ -S * m,Dj W Dj f ρ ) +   2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 SS 1/2 λ + Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 SS 1/2 λ 2 + 1   * f m,λ -f ρ K , where SS λ = S * m S m + λ 2 I . From Lemma 5, we know that if |D| ≥ 32 log(2/δ) 1 + 2τ 2 λ -1 , S * m S m + λ 2 I -1/2 Φ m,D W D Φ T m,D -S * m S m S * m S m + λ 2 -1/2 ≤ 1 2 . Combining the above inequality and Proposition 2, for any δ > 0, with probability at least 1 -δ, we can obtain, Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 ≤ √ 2. ( ) From Lemma 2, we have f m,D,λ -f m,λ K ≤ Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 (Φ m,D W D ȳD -S * m W D f ρ ) + S * m S m + λ 2 I -1/2 (S * m W D f ρ -S * m,D W D f ρ ) + Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 + Φ m,D W D Φ T m,D + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2   f m,λ -f ρ K . From Proposition 3, Lemma 3, Lemma 4, and Eq.( 27), we know that if |D| ≥ Ω τ 2 λ -1 , we have f m,D,λ -f m,λ K = O Υ m,D,λ log 1 δ + f m,λ -f λ K + f λ -f ρ K , where Υ m,D,λ = O 1 λ|D| . Note that S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 ≤ S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 . According to Proposition 4 and Lemma 8, we have f 0 m,D,λ -f m,D,λ K =O   p j=1 |D j | 2 p k=1 |D k | 2 S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 Υ m,Dj ,λ log 1 δ +λ r S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 . ( ) Combining Eq.( 26), Eq.( 28), Eq.( 30), Proposition 4, and Lemma 8, one can obtain, if m = Ω λ -2r ∨ λ -1 log 1 λδ , with probability 1 -δ, we have f 0 m,D,λ -f ρ K =O   p j=1 |D j | 2 p k=1 |D k | 2 S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 Υ m,Dj ,λ log 1 δ +Υ m,D,λ log 1 δ + λ r S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 + λ r . According to Lemma 5, we have S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 ≤ 2 log 2 (2/δ) 2τ 2 λ -1 + 1 |D| + 2 log(2/δ) (2τ 2 λ -1 + 1) |D| . ( ) Set λ = O p j=1 |Dj | p k=1 |D k | 2 1 1+r , we have the number of random features m = Ω      p j=1 |D j | p k=1 |D k | 2   -2r 1+r    . Combining Eq.(31), Eq.( 29), and Eq.(32), we have f 0 m,D,λ -f ρ K = O      p j=1 |D j | p k=1 |D k | 2   r 1+r log 1 δ    . We complete this proof. C PROOF OF THEOREM 2 C.1 BOUND TERMS Lemma 9. We have f l m,D,λ -f m,D,λ K ≤   p j=1 |D j | 2 p k=1 |D k | 2 J m   l f 0 m,D,λ -f m,D,λ K + √ λ ḡ0 m,D,λ -g m,D,λ , where J m =2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 + 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 . Proof. Note that g m,D,λ =ḡ l-1 m,D,λ -Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D + λ 2 I ḡl-1 m,D,λ -Φ m,D W D ȳD , ḡl m,D,λ =ḡ l-1 m,D,λ - p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 Φ m,D W D Φ T m,D + λ 2 I ḡl-1 m,D,λ -Φ m,D W D ȳD . Thus, we have g m,D,λ -ḡl m,D,λ =ḡ l-1 m,D,λ -Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D + λ 2 I ḡl-1 m,D,λ -Φ m,D W D ȳD -ḡl-1 m,D,λ + p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 * Φ m,D W D Φ T m,D + λ 2 I ḡl-1 m,D,λ -Φ m,D W D ȳD = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 -Φ m,D W D Φ T m,D + λ 2 I -1 * Φ m,D W D Φ T m,D + λ 2 I ḡl-1 m,D,λ -Φ m,D W D ȳD . The above can be convert into g m,D,λ -ḡl m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 Φ m,D W D Φ T m,D -Φ m,Dj W Dj Φ T m,Dj * Φ m,D W D Φ T m,D + λ 2 I -1 Φ m,D W D Φ T m,D + λ 2 I ḡl-1 m,D,λ -Φ m,D W D ȳD = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 Φ m,D W D Φ T m,D -Φ m,Dj W Dj Φ T m,Dj * ḡl-1 m,D,λ -g m,D,λ = p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 Φ m,D W D Φ T m,D -S * m S m ḡl-1 m,D,λ -g m,D,λ Term-A + p j=1 |D j | 2 p k=1 |D k | 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1 S * m S m -Φ m,Dj W Dj Φ T m,Dj ḡl-1 m,D,λ -g m,D,λ Term-B . Note that S m * Term-A =S m S * m S m + λ 2 I -1/2 S * m S m + λ 2 I 1/2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 * Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 S * m S m + λ 2 I -1/2 * Φ m,D W D Φ T m,D -S * m S m S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m + λ 2 I ḡl-1 m,D,λ -g m,D,λ . Note that S m S * m S m + λ 2 I -1/2 = S * m S m + λ 2 I -1/2 S * m S m S * m S m + λ 2 I -1/2 1/2 ≤ 1, so, we have S m * Term-A K ≤ Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m S m + λ 2 I ḡl-1 m,D,λ -g m,D,λ . Note that S * m S m ḡl-1 m,D,λ -g m,D,λ = S * m f l-1 m,D,λ -f m,D,λ . Substituting the above into Eq.( 35), we have S m * Term-A K ≤ Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 S * m f l-1 m,D,λ -f m,D,λ + λ 2 Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * S * m S m + λ 2 I -1/2 ḡl-1 m,D,λ -g m,D,λ ≤ Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 * f l-1 m,D,λ -f m,D,λ K + √ λ ḡl-1 m,D,λ -g m,D,λ , the last inequality use the fact that S * m S m + λ 2 I -1/2 S * m = S * m S m + λ 2 I -1/2 S * m S m (S * m S m + λI) -1/2 1/2 ≤ 1. Using the same process, we can obtain that  S m * Term-B K ≤ Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 * f l-1 m,D,λ -f m,D,λ K + √ λ ḡl-1 m,D,λ -g m,D,λ . Thus, we have f m,D,λ -f l m,D,λ K = S m g m,D,λ -ḡl m,D,λ K ≤ p j=1 |D j | 2 p k=1 |D k | 2 S m * Term-A K + S m * Term-B K ≤ p j=1 |D j | 2 p k=1 |D k | 2   Φ m,Dj W Dj Φ T m,Dj + λ 2 I -1/2 S * m S m + λ 2 I 1/2 2 * S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 + Φ m,



Logarithmic terms of convergence rates and complexity are hidden in this paper.



In time complexity, solving the inverse of Φ m,Dj W Dj Φ T m,Dj + λ 2 I needs O(m 3 ) time and computing the matrices multiplication Φ m,Dj W Dj Φ T m,Dj requires O(m 2 |D j |) cost, where m is the number of random features. In space complexity, the key is to store Φ m,Dj , whose space complexity is O(m|D j |). Therefore, the time complexity, space complexity, and communication complexity of DRank-RF for each local processor are O(m 2 |D j |), O(m|D j |), and O(m), where m < |D j |. Not that, the computational cost of random features model is far less than m 2 |D j |. It is ignored when expressing the computational complexity. In the experiments, the training time of our methods includes the time of calculating the random features model.

Distributed Least Square Ranking with Random Features and communications (DRank-RF-C) Initialize: ḡ0 m,D,λ = 0 For l = 1 to M do Local processor: compute the local gradient G m,Dj ,λ (ḡ l-1 m,D,λ ) and communicate back to the global processor. Global processor: compute Ḡm,D,λ (ḡ lk | 2 G m,Dj ,λ (ḡ l-1 m,D,λ ) in Eq.(9) and communicate to each local processor. Local processor: compute β l-1 j in Eq.(8) and communicate back to the global processor. Global processor: compute ḡl m,D,λ in Eq.(7), and communicate to each local processor. End For Output: ḡM m,D,λ and f M m,D,λ = ḡM m,D,λ , φ m (•) 3.2 DRANK-RF WITH COMMUNICATIONS (DRANK-RF-C)

where L r K is the r-th power of L K . If the regularization parameter λ = O

When |D 1 | = . . . = |D p |, the convergence rate of the proposed DRank-RF is O ( |D| p )

and the number of random features m = Ω |D| 2r 1+r

), Chen et al. construct the divide-and-conquer pairwise ranking in kernel learning, called DRank. They study the statistical properties of DRank and establish its convergence analysis in expectation. The time complexity, space complexity, and communication complexity of DRank are O(|D j | 3 ), O(|D j | 2 ), and O(|D j |), respectively. The convergence rate in expectation only demonstrates the average information for multiple trails but fails to capture the learning performance for a single trail. Therefore, the probability version of the convergence rate of DRank in a single trial is proposed subsequently in Chen et al. (2021). The statistical properties and the convergence rate of DRank in probability are carefully analyzed and established in Chen et al. (2021). In addition, the paper Chen et al. (2021) proposes a communication strategy for DRank, called DRank-C, to improve the learning performance and provides its convergence rate in probability. The time complexity and space complexity of DRank-C are O(|D j | 3 + M |D j ||D|) and O(|D j | 2

Figure 1: The testing error and training time on simulated datasets. (a) and (b) are about the number of random features m with p = 2. (c) and (d) are about the number of partitions p with m = 200 in DRank-RF.

6.2 SIMULATED EXPERIMENTS Inspired by the numerical experiments in Chen et al. (2021); Kriukova et al. (2016), we consider the following way to generate the synthetic data. The inputs {x i } |D | i=1 ∈ R |D |×q are randomly chosen from {1, • • • , 100}, and the corresponding outputs {y i } |D | i=1 are generated from the model

Figure 1(a) and Figure 1(b) show the testing error and training time (logarithmizing it) in seconds about the number of random features m with p = 2 and indicate that DRank-RF has an obvious advantage over DRank and LSRank, even one order of magnitude less, in time cost. In the testing error, the gap between DRank-RF and DRank decreases as m increases. Finally, there is no significant dif-

Figure 1(c) and Figure 1(d) show the testing error and training time about the number of partitions p with m = 200 for DRank-RF. Figure 1(c) shows DRank-RF keeps the same accuracy level as DRank.With the increase of the number of partitions p, the testing errors increase in p-related algorithms, which is in line with the theoretical analysis. In Figure1(d), with the increase of p, the training time decreases in distributed algorithms (DRank-RF and DRank). Our algorithm DRank-RF has a significant advantage over LSRank and DRank in the training time. In particular, the time cost of DRank with p = 30 is higher than that of DRank-RF with p = 15, that is to say, the proposed DRank-RF requires less expensive hardware devices, under the same scenario and time cost. Combining Figure1(c) and Figure1(d), we obtain that DRank-RF can use fewer hardware devices (local processors) to achieve a smaller testing error under the same training time, which is consistent with the theoretical analysis.

Under the conditions of M =16, p=2, and p=10, the testing error of DRank-RF-C is convergent and does not change with the increase of the number of communications. Under the condition of p=15, the testing error of DRank-RF-C decreases with the increase of the number of communications, which demonstrates the effectiveness of the communication strategy on the real dataset and is consistent with our Theorem 2. The training time in the distributed algorithms decreases with the increase of p. The training time in communication-based algorithms increases with the increase of the number of communications. The proposed DRank-RF and Drank-RF-C have significant advantages over LSRank, DRank, and DRank-C in the training time. These are consistent with the theoretical analysis. More experiments on different datasets are given in Appendix E.

S m -Φ m,Dj W Dj Φ T Φ m,D W D Φ T m,D -S * m S m ḡl-1 m,D,λ -g m,D,λ m S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 * ḡl-1 m,D,λ -g m,D,λ . (37)Under review as a conference paper at ICLR 2023 Combining Eq.(36) and Eq.(37), we havef m,D,λ -f l mm S m -Φ m,Dj W Dj Φ T m,Dj S * m S m + λ 2 I -1/2 l * f 0 m,D,λ -f m,D,λ K + √ λ ḡ0 m,D,λ -g m,D,λ . C.2 PROOF OF THEOREM 2 Proof. Note that f l m,D,λ -f ρ K = f l m,D,λ -f m,D,λ + f m,D,λ -f m,λ + f m,λ -f λ + f λ -f ρ K ≤ f l m,D,λ -f m,D,λ K + f m,D,λ -f m,λ K + f m,λ -f λ K + f λ -f K .(38) Substituting Lemma 1, Lemma 2, Lemma 3, Lemma 4, Eq.(27), and Eq.(28) into Lemma 6 and Lemma 7, we havef 0 m,D,λ -f m,D,λ K + ,D W D ȳD -S * m W D f ρ ) + SS -1/2 λ (S * m W D f ρ -S * m,D W D f ρ ) + f m,λ -f ρ K k | 2 K m,Dj + K m,Dj Q m   ,whereSS λ = S * m S m + λ 2 I , K m,D = S * m S m + λ 2 I -1/2 S * m S m -Φ m,D W D Φ T m,D S * m S m + λ 2 I -1/2 ,andQ m = Υ m,Dj ,λ + f m,λ -f λ + f λ -f ρ K .Combining the above inequality and Lemma 9, and note that SS (38), Eq.(39), Proposition 4, and Lemma 8, one can obtain, if m = Ω λ -2r ∨ λ -1 log 1 λδ , with probability 1 -δ, we havef l m,D,λ -f ρ K Set λ = O(|D| -1 1+r ), |D 1 | = . . . = |D p | = |D| p , and the number of random features m = Ω |D| 2r 1+r , we have f M m,D,λ -f ρ K = O p

Comparison of the average testing error (standard deviation) and training time (in seconds) on Movie-Lens dataset, with partitions p = 2, 10, 15 and random features m = 100, 150. 2, 8, and 16 are the number of communications.

complexity, DRank-RF only requires O(m 2 |D j |) time and O(m|D j |) memory, which are the least compared with the existing state-of-the-art DRank. Experiments verify that our proposed method keeps the similar testing error as the exact and state-of-the-art approximate methods and has a greatly advantage over the exact and state-of-the-art approximate

annex

where M = l. We complete this proof.

D PROPOSITIONS

Proposition 1 ( (Liu et al., 2021) ). Let ζ 1 , . . . , ζ n with n ≥ 1, be i.i.d random vectors on a separable Hilbert spaces H such that H = Eζ ⊗ ζ is a trace class, and for any λ there existsThen for any δ ≥ 0, with probability at least 1 -2δ, the following holdsProposition 2 ( (Blanchard & Krämer, 2010) ). For any self-adjoint and positive semidefinite operators A and B, if there exists η > 0 such that the following inequality holdsProposition 3 (Proposition 10 in Rudi & Rosasco (2017) ). For any δ ∈whereProposition 4 (Eq.( 9) in Chen et al. (2021) , Chen (2012) ). Assume thatHere we prove the gradient of the empirical risk ofλ g 2 on g is 4G m,Dj ,λ (g) for all (x i , y i ), (x k , y k ) ∈ D j .Proof. We haveSo, we have the results.

E SUPPLEMENTARY EXPERIMENTS

We add the experiments on the dataset Jester Joke. Jester Joke is publicly available from the following URL: http://www.grouplens.org/taxonomy/term/14 and contains over 4.1 million continuous anonymous ratings (-10.00 to + 10.00) of 100 jokes from 73,421 users. We group the reviewers according to the number of jokes they have reviewed. The grouping is 40-60 jokes. For a given test reviewer, 300 reference reviewers are chosen at random from the group and their rating are used to form the input vectors. 70 percent of the test reviewer's joke ratings are used for training and the rest for testing. Missing review values in the input features are populated with the median review score of the given reference reviewer. Here, we add the comparison with MPRank algorithm (Cortes et al., 2007) . It is not a distributed algorithm related to this paper, but it is a representative algorithm in the field of least square ranking, so it is compared here. The empirical evaluations are given in Table 3 where the number of random features is m = 30 and 50 and the number of partitions is p = 2 and 4. In Table 3 , we can find that the experimental results are similar to those on the simulated data and MovieLens dataset. The average testing errors of our methods, the exact method, MPRank, and DRank remain at the same level, which verify the effectiveness of our methods on the real dataset. The testing error of DRank-RF-C decreases with the increase of the number of communications, which demonstrates the effectiveness of the communication strategy on the real dataset. The proposed DRank-RF and Drank-RF-C have significant advantages over LSRank, MPRank, DRank, and DRank-C in the training time. These are consistent with the theoretical analysis.We add the experiments under the same experiments setting as (Chen et al., 2021) on the datasets mentioned in the main paper. Table 4 shows the experimental results with partitions p = 60, dimension q = 3, and random features m = 150 on simulated dataset with the same data generating distribution as (Chen et al., 2021) 

