EFFECTIVE DISTRIBUTED LEARNING WITH RANDOM FEATURES: IMPROVED BOUNDS AND ALGORITHMS

Abstract

In this paper, we study the statistical properties of distributed kernel ridge regression together with random features (DKRR-RF), and obtain optimal generalization bounds under the basic setting, which can substantially relax the restriction on the number of local machines in the existing state-of-art bounds. Specifically, we first show that the simple combination of divide-and-conquer technique and random features can achieve the same statistical accuracy as the exact KRR in expectation requiring only O(|D|) memory and O(|D| 1.5 ) time. Then, beyond the generalization bounds in expectation that demonstrate the average information for multiple trails, we derive generalization bounds in probability to capture the learning performance for a single trail. Finally, we propose an effective communication strategy to further improve the performance of DKRR-RF, and validate the theoretical bounds via numerical experiments.

1. INTRODUCTION

Kernel ridge regression (KRR) is one of the most popular nonparametric learning methods (Vapnik, 2000) . Despite the excellent theoretical guarantees, KRR does not scale well in large scale settings because of high time and memory complexities (Liu et al., 2013; 2014; 2017; 2018; 2020b; Liu & Liao, 2015; Li et al., 2018; 2019c) . Distributed learning (Zhang et al., 2013; Hsieh et al., 2014; Chang et al., 2017b; Li et al., 2019b; Lin et al., 2020) , random features (Rahimi & Recht, 2007; Sutherland & Schneider, 2015; Rudi & Rosasco, 2017; Rudi et al., 2018; Liu et al., 2020a; Avron et al., 2017a; Yu et al., 2016; Jacot et al., 2020), and Nyström methods (Drineas & Mahoney, 2005; Ding & Liao, 2012; Yang et al., 2012; Camoriano et al., 2016; Si et al., 2016; Musco & Musco, 2017; Kriukova et al., 2017) are the most widely used large scale techniques to address the scalability issues. Recent statistical learning works on KRR together with large scale approaches demonstrate that these large scale approaches can not only obtain great computational gains but also can guarantee the optimal theoretical properties, such as KRR with divide-and-conquer (Zhang et al., 2013; 2015; Chang et al., 2017b; a; Guo et al., 2017; Lin et al., 2017; Li et al., 2019b; d; Lin et al., 2020) , with random features (Rudi & Rosasco, 2017; Li et al., 2019e; Carratino et al., 2018; Yang et al., 2012) , and with Nyström methods (Bach, 2013; Alaoui & Mahoney, 2015; Rudi et al., 2015; 2017; Ding et al., 2020) . The combinations of distributed learning and other large scale approaches are very intuitive but effective strategies to further improve the effectiveness, such as distributed learning with gradient descent algorithms (Lin & Zhou, 2018; Richards et al., 2020) , with multi-pass SGD (Lin & Rosasco, 2017; Lin & Cevher, 2018; 2020) , with random features (Li et al., 2019b) , and with Nyström methods (Yin et al., 2020) . The optimal generalization performance of these combining approaches has been studied, however, the main theoretical problem is that there is a strict restriction on the number of local machines. For sample, in (Lin & Zhou, 2018; Li et al., 2019b; Yin et al., 2020) , to guarantee the optimal generalization performance in the basic setting, the upper bounds of the local machines are restricted to be a constant, which is difficult to be satisfied in real applications. In this paper, we aim at enlarging the number of local machines by considering communications among different local machines. This paper makes the following three main contributions. Firstly, we improve the existing state-of-art results of the divide-and-conquer technique together with random features. We prove that the optimal generalization performance can be guaranteed even the partitions reach Ω( |D|), which are limited to a constant Ω(1) for the existing bounds in the basic setting, |D| is the size of the data sets. Secondly, to essentially reflect the generalization performance, beyond the minimax optimal rates in expectation, we derive optimal learning rates in probability, which can capture the learning performance for a single trail. Finally, we develop a communication strategy to further improve the performance of our proposed method, and validate the effectiveness of the proposed communications via both theoretical assessments and numerical experiments.

Related Work

The most related work includes the statistical analysis of distributed learning and random features. Distributed learning. Optimal learning rates for divide-and-conquer KRR in expectation were established in the seminal work (Zhang et al., 2013; 2015) . An improved bound was derived in (Lin et al., 2017) based on a novel tool of integral operator. Based on the proof techniques proposed in (Zhang et al., 2013; 2015; Lin et al., 2017) , optimal learning rates were established for distributed spectral algorithms (Guo et al., 2017; Blanchard & Mücke, 2018; Lin & Cevher, 2020) , distributed gradient descent algorithms (Lin & Zhou, 2018; Richards et al., 2020) , distributed semi-supervised learning (Chang et al., 2017b) , distributed local average regression (Chang et al., 2017a; Lin & Cevher, 2020) , localized SVM (Meister & Steinwart, 2016), etc. Some other communication strategies for distributed learning have been provided, see e.g. (Fan et al., 2019; Li et al., 2019a; Lin & Cevher, 2020; Li et al., 2020) , and references therein. The theoretical analysis mentioned above shows that the divide-and-conquer learning can achieve the same statistical accuracy as the exact KRR, however, there is a strict restriction on the number of local machines. The optimal learning rates with a less strict condition on the number of local machines for distributed stochastic gradient methods and spectral algorithms were established in (Lin & Cevher, 2020) . In (Lin et al., 2020) , they considered the communications among different local machines to enlarge the number of local machines. However, the communication strategy proposed in (Lin et al., 2020) based on an operator representation, which requires communicating the input data between each local machine. Thus, it is difficult to protect the data privacy of each local machine. Furthermore, for each iteration, the communication complexity of each local machine is O(|D|d), where d denotes the dimension, which is infeasible in practice for large scale data sets. Random Features. The generalization bound of random features was first proposed in (Rahimi & Recht, 2008) , which shows that O(|D|) random features are needed for O(1/ |D|) learning rate. Some works further studied its theoretical performance (Cortes et al., 2010; Yang et al., 2012; Sutherland & Schneider, 2015; Sriperumbudur & Szabó, 2015; Bach, 2017; Avron et al., 2017b) . By applying the standard integral operator framework (Smale & Zhou, 2007; Caponnetto & Vito, 2007) , the optimal generalization bounds of KRR with random features were established in (Rudi & Rosasco, 2017) , which requires only O( |D| log(|D|)) random features. To decrease the size of random features, an improved approach was proposed based on a novel leverage score sampling strategy (Rudi et al., 2018) . Sun et al. (2018) extended the result of (Rudi & Rosasco, 2017) to SVM. In (Li et al., 2019e) , they further devised a simple framework for the unified analysis of random Fourier features, which can be applied to KRR, as well as SVM and logistic regression. To further improve the effectiveness, recently, Li et al. (2019b) considered the simple combination of divide-and-conquer and random features. However, to guarantee the optimal generalization performance, the number of local machines should be restricted to a constant, degenerating it into a single random features-based large scale KRR.

2. BACKGROUND

In a standard framework of supervised learning, there is a probability space X × Y with an unknown distribution ρ, where X is the input and Y is the output space. The sample set D = {(x i , y i )} n i=1 of size n is drawn i.i.d from X × Y with respect to ρ. Let K : X × X → R be a Mercer kernel, and H K be its reproducing kernel Hilbert space (RKHS) (Steinwart & Christmann, 2008; Vapnik, 2000) , and assume that K(x, x ) ≤ κ, ∀x, x ∈ X . Throughout, we will denote the inner product in H K by •, • K , and corresponding norm by • K .

