SUPERVISED RANDOM FEATURE REGRESSION VIA PROJECTION PURSUIT

Abstract

Random feature methods and neural network models are two popular nonparametric modeling methods, which are regarded as representatives of shallow learning and Neural Network, respectively. In practice random feature methods are short of the capacity of feature learning, while neural network methods lead to computationally heavy problems. This paper aims at proposing a flexible but computational efficient method for general nonparametric problems. Precisely, our proposed method is a feed-forward two-layer nonparametric estimation, and the first layer is used to learn a series of univariate basis functions for each projection variable, and then search for their optimal linear combination for each group of these learnt functions. Based on all the features derived in the first layer, the second layer attempts at learning a single index function with an unknown activation function. Our nonparametric estimation takes advantage of both random features and neural networks, and can be seen as an intermediate bridge between them.

1. INTRODUCTION

Kernel methods are one of the most powerful methods for nonlinear statistical learning problems attributed to their excellent statistical theories and flexible modeling framework. Using the randomized algorithms for approximating kernel matrices, random feature (RF) models attract increasing attention due to that they significantly reduce the extensive hand tuning form the user for training, but obtain similar or better prediction accuracy with limited data size compared to neural network models (Du et al., 2022; Zhen et al., 2020) . The RF model can be traced back to the work of (Rahimi & Recht, 2007) , and was successfully developed by (Li et al., 2019b) . To be specific, for observations (y i , x i ) n i=1 ,x i ∈ R p , y i ∈ R , RF models consider a linear combination over a set of prespecified nonlinear functions on a relatively low-dimensional randomized feature space to predict y. That is, yi = f (xi) + εi := N j=1 αj σ( xi, θj / √ p) + εi, i = 1, • • • , n, where N → ∞, α, x = p j=1 α j x j , and σ(•) is a pre-specified function, like Relu or the Sigmoid function. Here, θ j is chosen randomly from a prespecified distribution, say, a unit ball, i.e., θ j ∼ Unif(S p-1 ( √ p)), where S (d-1) (r) denotes the sphere of radius r in d dimensions, and r = √ d. Model equation 1 involves unknown parameters α j , j = 1, • • • , N only. The coefficients α in the RF model can be estimated using the following ridge regression: α(λ) = arg min α∈R N    1 n n i=1   yi - N j=1 αj σ( θj , xi )   2 + N λ p α 2 2    . (2) Let F RF (Θ) = f (x) = N i=1 αiσ( θi, x ) : αi ∈ R ∀i ≤ N , where Θ ∈ R N ×p is a matrix whose i-th row is the vector θ i . When the number of random features, N , goes to infinity, under a suitable bound on the 2 norm of the coefficients, F RF reduce to certain Reproducing Kernel Hilbert Space (RKHS) (Liu et al., 2020) . Specifically, the ridge regression over the function class converges to kernel ridge regression (KRR) with respect to the kernel: H RF p (x 1 , x 2 ) := h RF p ( x 1 , x 2 p ) = E [σ( θ, x 1 )σ( θ, x 2 )] . Here, the expectation is with respect to θ. Clearly, distinct distributions generating θ j and different activation functions induce different RKHS spaces. For examples, when θ follows a standard multivariate normal distribution, and the activation function is the ReLU activation σ(x) = max(0, x), the kernel corresponds to the first order arc-cosine kernel. Another example is, if the activation function is σ(x) = [cos(x), sin(x)] , the kernel corresponds to the Gaussian kernel (Rahimi & Recht, 2007; Liu et al., 2020) . According to Bochner's theorem, the spectral distribution µ k of a stationary kernel k is the finite measure induced by a Fourier transform, i.e., k(x -x ) = exp iθ (x -x ) µ k (dθ). However, it is known that the distribution and the activation function may meet misspecification issues on the function space leading to inefficient or even wrong estimation (Sinha & Duchi, 2016; Derakhshani et al., 2021) . Note that general kernel k(x, x ) describes the distance xx who converges to a constant quickly as the dimension increases (Liu et al., 2020) . Such kind of locality in terms of stationary and monotonic properties result in that they can not reveal more important information in the feature spaces, which largely restricts the performance of kernel methods in complex tasks (Xue et al., 2019) . The RF models overcome this issue with the induce of the coefficients θ and its associated spectral distribution. In specific, the RF model learns a kernel function based on the fixed activation function σ(•) indexed by (approximately) infinite random parameters from a prespecified distribution. In terms of the algorithm and implementation, the RF model improves the quality of approximation and reduces the requirement on time and space compared with traditional kernel approximation methods (Liu et al., 2020) . This is because that the RF model is able to map features into a new space where the dot product can approximate the kernel accurately, thus improving the quality of the approximation (Yu et al., 2016) . Comparing to other kernel methods that mapping x to a high dimensional space, RF uses a randomized feature map to map x to a low-dimensional Euclidean inner product space. Consequently, we can simply use linear learning methods to approximate the result of the nonlinear kernel machine (Rahimi & Recht, 2007) , which saves computation time and reduces computation complexity. Also, unlike Nystrom methods or other data dependent methods, RF is a typical dataindependent method with an explicit feature mapping. Data-independent implies that RF does not need large samples to guarantee its approximation property (Liu et al., 2020) . However, it still fails to provide satisfactory performance for complex tasks due to its representing of a simple stationary kernel only. In contrast, sampling θ from a mixture distribution would bring in extra computational complexity (Avron et al., 2017) . On the other hand, recently, some work have been done via the kernel Neural Network (KDL), a combination of kernel methods and Neural Network, to overcome the limitation of the locality (Xue et al., 2019) , and adopt the kernel trick to make computation tractable. In particular, KDL methods incorporate Neural Network methods to kernel functions, i.e., k(g(x, θ), g(x , θ)), where g(x, θ) is a non-linear mapping given by a deep architecture. KDL trains a deep architecture g(•; θ) indexed by finitely many fixed parameters and then plugs it into a simple kernel function such as a Gaussian kernel. In this way, KDL adaptively estimates basis functions with finitely many parameters at the price of requiring lots of hand tuning work (lack of a principled framework to guide parameter choices), and thus a large number of data size is needed. In this paper, following a similar spirit of KDL, we develop a novel supervised RF method (SRF) to overcome the local kernel's limitation by first adaptively estimating basis functions through (approximately) infinite tuning-free kernel techniques based on low-dimensional variables in the form of x, θ with θ from a simple distribution, and then adaptively estimating the corresponding weights and the unknown link in a supervised way. Most importantly, with the incorporation of the information from the outcome to learn the basis functions, the proposed SRF has excellent predictive performance with the limited data size, in addition to the advantage of being interpretable, and hand-tuning free. It is worth noting that standard RF only has one single layer, which may not thoroughly express the complexity of the data. Instead, SRF includes two layers, which makes it have stronger ability of expression. Moreover, unlike KDL, which only introduces the information of y at the last layer, SRF incorporates the information of y at each layer, leading to a higher prediction power without abundant layers. This idea is very similar to the idea of Conditional Variational Autoencoders(CVAE), which is also known for good performance on limited data size and being energetic efficient (Kingma & Welling, 2013; Sohn et al., 2015) . Energetic efficiency is an important aspect of the SRF approach. Compared to CVAE, the proposed SRF method enjoys easier interpretation benefit via its flexible semi-parametric structure. The proposed SRF has the following contributions: First, computational simplicity. Conventional RF models including training the random features in the implicit kernel learning (Li et al., 2019a) , choosing random features via kernel alignments in kernel allegement method (Sinha & Duchi, 2016; Cortes et al., 2010) , choosing random features by score functions in the kernel polarization method (Shahrampour et al., 2018) , among others, require a huge computational burden. Instead, the SRF model generates the random features randomly from a simple pre-specified distribution. In comparison

