SUPERVISED RANDOM FEATURE REGRESSION VIA PROJECTION PURSUIT

Abstract

Random feature methods and neural network models are two popular nonparametric modeling methods, which are regarded as representatives of shallow learning and Neural Network, respectively. In practice random feature methods are short of the capacity of feature learning, while neural network methods lead to computationally heavy problems. This paper aims at proposing a flexible but computational efficient method for general nonparametric problems. Precisely, our proposed method is a feed-forward two-layer nonparametric estimation, and the first layer is used to learn a series of univariate basis functions for each projection variable, and then search for their optimal linear combination for each group of these learnt functions. Based on all the features derived in the first layer, the second layer attempts at learning a single index function with an unknown activation function. Our nonparametric estimation takes advantage of both random features and neural networks, and can be seen as an intermediate bridge between them.

1. INTRODUCTION

Kernel methods are one of the most powerful methods for nonlinear statistical learning problems attributed to their excellent statistical theories and flexible modeling framework. Using the randomized algorithms for approximating kernel matrices, random feature (RF) models attract increasing attention due to that they significantly reduce the extensive hand tuning form the user for training, but obtain similar or better prediction accuracy with limited data size compared to neural network models (Du et al., 2022; Zhen et al., 2020) . The RF model can be traced back to the work of (Rahimi & Recht, 2007) , and was successfully developed by (Li et al., 2019b) . To be specific, for observations (y i , x i ) n i=1 ,x i ∈ R p , y i ∈ R , RF models consider a linear combination over a set of prespecified nonlinear functions on a relatively low-dimensional randomized feature space to predict y. That is, yi = f (xi) + εi := N j=1 αj σ( xi, θj / √ p) + εi, i = 1, • • • , n, where N → ∞, α, x = p j=1 α j x j , and σ(•) is a pre-specified function, like Relu or the Sigmoid function. Here, θ j is chosen randomly from a prespecified distribution, say, a unit ball, i.e., θ j ∼ Unif(S p-1 ( √ p)), where S (d-1) (r) denotes the sphere of radius r in d dimensions, and r = √ d. Model equation 1 involves unknown parameters α j , j = 1, • • • , N only. The coefficients α in the RF model can be estimated using the following ridge regression: α(λ) = arg min α∈R N    1 n n i=1   yi - N j=1 αj σ( θj , xi )   2 + N λ p α 2 2    . (2) Let F RF (Θ) = f (x) = N i=1 αiσ( θi, x ) : αi ∈ R ∀i ≤ N , where Θ ∈ R N ×p is a matrix whose i-th row is the vector θ i . When the number of random features, N , goes to infinity, under a suitable bound on the 2 norm of the coefficients, F RF reduce to certain Reproducing Kernel Hilbert Space (RKHS) (Liu et al., 2020) . Specifically, the ridge regression over the function class converges to kernel ridge regression (KRR) with respect to the kernel: H RF p (x 1 , x 2 ) := h RF p ( x 1 , x 2 p ) = E [σ( θ, x 1 )σ( θ, x 2 )] . Here, the expectation is with respect to θ. Clearly, distinct distributions generating θ j and different activation functions induce different RKHS spaces. For examples, when θ follows a standard multivariate normal distribution, and the activation function is the ReLU activation σ(x) = max(0, x), the kernel corresponds to the first order arc-cosine kernel. Another example

