EFFECTIVE DISTRIBUTED LEARNING WITH RANDOM FEATURES: IMPROVED BOUNDS AND ALGORITHMS

Abstract

In this paper, we study the statistical properties of distributed kernel ridge regression together with random features (DKRR-RF), and obtain optimal generalization bounds under the basic setting, which can substantially relax the restriction on the number of local machines in the existing state-of-art bounds. Specifically, we first show that the simple combination of divide-and-conquer technique and random features can achieve the same statistical accuracy as the exact KRR in expectation requiring only O(|D|) memory and O(|D| 1.5 ) time. Then, beyond the generalization bounds in expectation that demonstrate the average information for multiple trails, we derive generalization bounds in probability to capture the learning performance for a single trail. Finally, we propose an effective communication strategy to further improve the performance of DKRR-RF, and validate the theoretical bounds via numerical experiments.

1. INTRODUCTION

Kernel ridge regression (KRR) is one of the most popular nonparametric learning methods (Vapnik, 2000) . Despite the excellent theoretical guarantees, KRR does not scale well in large scale settings because of high time and memory complexities (Liu et al., 2013; 2014; 2017; 2018; 2020b; Liu & Liao, 2015; Li et al., 2018; 2019c) . Distributed learning (Zhang et al., 2013; Hsieh et al., 2014; Chang et al., 2017b; Li et al., 2019b; Lin et al., 2020) , random features (Rahimi & Recht, 2007; Sutherland & Schneider, 2015; Rudi & Rosasco, 2017; Rudi et al., 2018; Liu et al., 2020a; Avron et al., 2017a; Yu et al., 2016; Jacot et al., 2020), and Nyström methods (Drineas & Mahoney, 2005; Ding & Liao, 2012; Yang et al., 2012; Camoriano et al., 2016; Si et al., 2016; Musco & Musco, 2017; Kriukova et al., 2017) are the most widely used large scale techniques to address the scalability issues. Recent statistical learning works on KRR together with large scale approaches demonstrate that these large scale approaches can not only obtain great computational gains but also can guarantee the optimal theoretical properties, such as KRR with divide-and-conquer (Zhang et al., 2013; 2015; Chang et al., 2017b; a; Guo et al., 2017; Lin et al., 2017; Li et al., 2019b; d; Lin et al., 2020) , with random features (Rudi & Rosasco, 2017; Li et al., 2019e; Carratino et al., 2018; Yang et al., 2012) , and with Nyström methods (Bach, 2013; Alaoui & Mahoney, 2015; Rudi et al., 2015; 2017; Ding et al., 2020) . The combinations of distributed learning and other large scale approaches are very intuitive but effective strategies to further improve the effectiveness, such as distributed learning with gradient descent algorithms (Lin & Zhou, 2018; Richards et al., 2020) , with multi-pass SGD (Lin & Rosasco, 2017; Lin & Cevher, 2018; 2020) , with random features (Li et al., 2019b) , and with Nyström methods (Yin et al., 2020) . The optimal generalization performance of these combining approaches has been studied, however, the main theoretical problem is that there is a strict restriction on the number of local machines. For sample, in (Lin & Zhou, 2018; Li et al., 2019b; Yin et al., 2020) , to guarantee the optimal generalization performance in the basic setting, the upper bounds of the local machines are restricted to be a constant, which is difficult to be satisfied in real applications.

