DISTRIBUTED LEAST SQUARE RANKING WITH RANDOM FEATURES

Abstract

In this paper, we study the statistical properties of pairwise ranking using distributed learning and random features (called DRank-RF) and establish its convergence analysis in probability. Theoretical analysis shows that DRank-RF remarkably reduces the computational requirements while preserving a satisfactory convergence rate. An extensive experiment verifies the effectiveness of DRank-RF. Furthermore, to improve the learning performance of DRank-RF, we propose an effective communication strategy for it and demonstrate the power of communications via theoretical assessments and numerical experiments.

1. INTRODUCTION

Distributed learning has attracted much attention in the literature and has been widely used for kernel learning in large scale scenarios (Zhang et al., 2013; Chang et al., 2017; Lin et al., 2020b) . The distributed kernel learning has mainly three ingredients: Processing the data subset in the local kernel machines and producing a local estimator; Communicating exclusive information such as the data (Bellet et al., 2015) , gradients (Zeng & Yin, 2018) and local estimator (Huang & Huo, 2019) between the local processors and the global processor; Synthesizing the local estimators and the communicated information on the global processor to produce a global estimator. Note that, in the divide-and-conquer learning, the second ingredient communications is not necessary. In the terms of practical challenges and theoretical analysis, the distributed learning has made significant breakthroughs in the multi-penalty regularization (Guo et al., 2019) , coefficient-based regularization (Pang & Sun, 2018) , spectral algorithms (Mücke & Blanchard, 2018; Lin et al., 2020a) , kernel ridgel regression (Yin et al., 2020; 2021) , and semi-supervised regression (Li et al., 2022) . All the above are restricted to pointwise kernel learning. However, the distributed learning in pairwise kernel learning still has a long way to go. The existing distributed pairwise learning (Chen et al., 2019; 2021) has high computational requirements, which motivates us to explore theoretic foundations and efficient methods for pairwise ranking kernel methods under distributed learning. Random features methods (Rahimi & Recht, 2007; Carratino et al., 2018; Liu et al., 2021) have a long and distinguished history, which embed the non-linear feature space (i.e. the Reproducing Kernel Hilbert Space associated with the kernel) into a low dimensional Euclidean space while incurring an arbitrarily small additive distortion in the inner product values. This enables one to overcome the high computational requirements of kernel learning since one can now work in an explicit low dimensional space with explicit representation whose complexity depends only on the dimensionality of the space. Random features have gained rapid progress in reducing the complexity of the kernel ridge regression (Liu et al., 2021) and semi-supervised regression (Li et al., 2022) .However, it remains unclear for complexity reduction and learning theory analysis to the distributed pairwise ranking kernel learning. In this paper, to reduce the computational requirements of pairwise ranking kernel learning, we investigate the method of combining distributed learning and random features for pairwise ranking kernel learning, called distributed least square ranking with random features (DRank-RF), to deal with large-scale applications, and study its statistical properties in probability by integral operators framework. To further improve the performance of DRank-RF, we consider communications among different local processors. The main contributions of this paper are as follows: 1) We construct a novel method DRank-RF to improve the existing state-of-the-art performance of the distributed pairwise ranking kernel learning. This work is the first time to apply random features to least square ranking and derive the theoretical guarantees, which is a new exploration of random features in least square ranking. In theoretical analysis, we derive the convergence rate of the proposed method, which is sharper than that of the existing state-of-the-art distributed pairwise ranking kernel learning (See Theorem 1). In computational complexity, DRank-RF requires essentially O(m 2 |D j |) time and O(m|D j |) memory, where m is the number of random features, m < |D j |, and |D j | is the number of data in each local processor. The proposed method can greatly reduce the computational requirements compared with the state-of-the-art works (See Table 1 ). Experimental results verify that the proposed method keeps the similar testing error as the exact and state-of-the-art approximate kernel least square ranking and has a great advantage over the exact and state-of-the-art approximate kernel least square ranking in the training time, which is consistent with our theoretical analysis. 2) We propose a communication strategy to further improve the performance of DRank-RF, called DRank-RF-C. Statistical analysis shows that DRank-RF-C obtains a faster convergence rate with the help of communication strategy than DRank-RF. And the numerical results validate the power of the proposed communication strategy. The paper is organized as follows: In section 2, we briefly introduce the least square ranking problem and the distributed least square ranking. In section 3, we introduce the proposed methods. Section 4 shows the theoretical analysis of the proposed DRank-RF and DRank-RF-C. In section 5, we compare the related works with the proposed methods. The following sections are the experiments and conclusions.

2. BACKGROUND

There is a compact metric space Z := (X , Y) ⊂ R q+1 , where X ⊂ R q and Y ⊂ [-b, b] for some positive constant b. The sample set D := {(x i , y i )} N i=1 of size N = |D| is drawn independently from an intrinsic Borel probability measure ρ on Z. ρ(y|X = x) denotes the conditional distribution for given input x. The hypothesis space used is the reproducing kernel Hilbert space (RKHS) (H K ) associated with a mercer kernel K : X × X → R (Aronszajn, 1950). We will denote the inner product in H K by •, • , and corresponding norm by • K .

2.1. LEAST SQUARE RANKING (LSRANK)

Least square ranking (LSRank) is one of the most popular learning methods in the machine learning community (Chen, 2012; Zhao et al., 2017; Chen et al., 2019) ,which can be stated as f D,λ = arg min f ∈H K E D (f ) + λ f 2 K and E D (f ) = 1 |D| 2 |D| i,k=1 (y i -y k -(f (x i ) -f (x k ))) 2 , where the regularized parameter λ > 0. The main purpose of LSRank is to find a function f : X → R through empirical observation, so that the ranking risk E(f ) = Z Z (y -y -(f (x) -f (x ))) 2 dρ(x, y)dρ (x , y ) can be as small as possible, where x, x ∈ X . The optimal predictor (Chen, 2012; Chen et al., 2013; Kriukova et al., 2016) under Eq.( 1) is the regression function f ρ (x) = Y ydρ(y|X = x), x ∈ X . Complexity Analysis LSRank requires O(|D| 3 ) time and O(|D| 2 ) space, which is prohibitive for large-scale settings.

2.2. DISTRIBUTED LEAST SQUARE RANKING (DRANK)

Let the dataset D = ∪ p j=1 D j and each subset D j := x j i , y j j |Dj | i=1 be stored in the j-th local processor for 1 ≤ j ≤ p. The DRank is defined by f 0 D,λ = p j=1 |D j | 2 p k=1 |D k | 2 f Dj ,λ

