SUPERVISED RANDOM FEATURE REGRESSION VIA PROJECTION PURSUIT

Abstract

Random feature methods and neural network models are two popular nonparametric modeling methods, which are regarded as representatives of shallow learning and Neural Network, respectively. In practice random feature methods are short of the capacity of feature learning, while neural network methods lead to computationally heavy problems. This paper aims at proposing a flexible but computational efficient method for general nonparametric problems. Precisely, our proposed method is a feed-forward two-layer nonparametric estimation, and the first layer is used to learn a series of univariate basis functions for each projection variable, and then search for their optimal linear combination for each group of these learnt functions. Based on all the features derived in the first layer, the second layer attempts at learning a single index function with an unknown activation function. Our nonparametric estimation takes advantage of both random features and neural networks, and can be seen as an intermediate bridge between them.

1. INTRODUCTION

Kernel methods are one of the most powerful methods for nonlinear statistical learning problems attributed to their excellent statistical theories and flexible modeling framework. Using the randomized algorithms for approximating kernel matrices, random feature (RF) models attract increasing attention due to that they significantly reduce the extensive hand tuning form the user for training, but obtain similar or better prediction accuracy with limited data size compared to neural network models (Du et al., 2022; Zhen et al., 2020) . The RF model can be traced back to the work of (Rahimi & Recht, 2007) , and was successfully developed by (Li et al., 2019b) . To be specific, for observations (y i , x i ) n i=1 ,x i ∈ R p , y i ∈ R , RF models consider a linear combination over a set of prespecified nonlinear functions on a relatively low-dimensional randomized feature space to predict y. That is, yi = f (xi) + εi := N j=1 αj σ( xi, θj / √ p) + εi, i = 1, • • • , n, where N → ∞, α, x = p j=1 α j x j , and σ(•) is a pre-specified function, like Relu or the Sigmoid function. Here, θ j is chosen randomly from a prespecified distribution, say, a unit ball, i.e., θ j ∼ Unif(S p-1 ( √ p)), where S (d-1) (r) denotes the sphere of radius r in d dimensions, and r = √ d. Model equation 1 involves unknown parameters α j , j = 1, • • • , N only. The coefficients α in the RF model can be estimated using the following ridge regression: α(λ) = arg min α∈R N    1 n n i=1   yi - N j=1 αj σ( θj , xi )   2 + N λ p α 2 2    . Let F RF (Θ) = f (x) = N i=1 αiσ( θi, x ) : αi ∈ R ∀i ≤ N , where Θ ∈ R N ×p is a matrix whose i-th row is the vector θ i . When the number of random features, N , goes to infinity, under a suitable bound on the 2 norm of the coefficients, F RF reduce to certain Reproducing Kernel Hilbert Space (RKHS) (Liu et al., 2020) . Specifically, the ridge regression over the function class converges to kernel ridge regression (KRR) with respect to the kernel: H RF p (x 1 , x 2 ) := h RF p ( x 1 , x 2 p ) = E [σ( θ, x 1 )σ( θ, x 2 )] . Here, the expectation is with respect to θ. Clearly, distinct distributions generating θ j and different activation functions induce different RKHS spaces. For examples, when θ follows a standard multivariate normal distribution, and the activation function is the ReLU activation σ(x) = max(0, x), the kernel corresponds to the first order arc-cosine kernel. Another example is, if the activation function is σ(x) = [cos(x), sin(x)] , the kernel corresponds to the Gaussian kernel (Rahimi & Recht, 2007; Liu et al., 2020) . According to Bochner's theorem, the spectral distribution µ k of a stationary kernel k is the finite measure induced by a Fourier transform, i.e., k(x -x ) = exp iθ (x -x ) µ k (dθ). However, it is known that the distribution and the activation function may meet misspecification issues on the function space leading to inefficient or even wrong estimation (Sinha & Duchi, 2016; Derakhshani et al., 2021) . Note that general kernel k(x, x ) describes the distance xx who converges to a constant quickly as the dimension increases (Liu et al., 2020) . Such kind of locality in terms of stationary and monotonic properties result in that they can not reveal more important information in the feature spaces, which largely restricts the performance of kernel methods in complex tasks (Xue et al., 2019) . The RF models overcome this issue with the induce of the coefficients θ and its associated spectral distribution. In specific, the RF model learns a kernel function based on the fixed activation function σ(•) indexed by (approximately) infinite random parameters from a prespecified distribution. In terms of the algorithm and implementation, the RF model improves the quality of approximation and reduces the requirement on time and space compared with traditional kernel approximation methods (Liu et al., 2020) . This is because that the RF model is able to map features into a new space where the dot product can approximate the kernel accurately, thus improving the quality of the approximation (Yu et al., 2016) . Comparing to other kernel methods that mapping x to a high dimensional space, RF uses a randomized feature map to map x to a low-dimensional Euclidean inner product space. Consequently, we can simply use linear learning methods to approximate the result of the nonlinear kernel machine (Rahimi & Recht, 2007) , which saves computation time and reduces computation complexity. Also, unlike Nystrom methods or other data dependent methods, RF is a typical dataindependent method with an explicit feature mapping. Data-independent implies that RF does not need large samples to guarantee its approximation property (Liu et al., 2020) . However, it still fails to provide satisfactory performance for complex tasks due to its representing of a simple stationary kernel only. In contrast, sampling θ from a mixture distribution would bring in extra computational complexity (Avron et al., 2017) . On the other hand, recently, some work have been done via the kernel Neural Network (KDL), a combination of kernel methods and Neural Network, to overcome the limitation of the locality (Xue et al., 2019) , and adopt the kernel trick to make computation tractable. In particular, KDL methods incorporate Neural Network methods to kernel functions, i.e., k(g(x, θ), g(x , θ)), where g(x, θ) is a non-linear mapping given by a deep architecture. KDL trains a deep architecture g(•; θ) indexed by finitely many fixed parameters and then plugs it into a simple kernel function such as a Gaussian kernel. In this way, KDL adaptively estimates basis functions with finitely many parameters at the price of requiring lots of hand tuning work (lack of a principled framework to guide parameter choices), and thus a large number of data size is needed. In this paper, following a similar spirit of KDL, we develop a novel supervised RF method (SRF) to overcome the local kernel's limitation by first adaptively estimating basis functions through (approximately) infinite tuning-free kernel techniques based on low-dimensional variables in the form of x, θ with θ from a simple distribution, and then adaptively estimating the corresponding weights and the unknown link in a supervised way. Most importantly, with the incorporation of the information from the outcome to learn the basis functions, the proposed SRF has excellent predictive performance with the limited data size, in addition to the advantage of being interpretable, and hand-tuning free. It is worth noting that standard RF only has one single layer, which may not thoroughly express the complexity of the data. Instead, SRF includes two layers, which makes it have stronger ability of expression. Moreover, unlike KDL, which only introduces the information of y at the last layer, SRF incorporates the information of y at each layer, leading to a higher prediction power without abundant layers. This idea is very similar to the idea of Conditional Variational Autoencoders(CVAE), which is also known for good performance on limited data size and being energetic efficient (Kingma & Welling, 2013; Sohn et al., 2015) . Energetic efficiency is an important aspect of the SRF approach. Compared to CVAE, the proposed SRF method enjoys easier interpretation benefit via its flexible semi-parametric structure. The proposed SRF has the following contributions: First, computational simplicity. Conventional RF models including training the random features in the implicit kernel learning (Li et al., 2019a) , choosing random features via kernel alignments in kernel allegement method (Sinha & Duchi, 2016; Cortes et al., 2010) , choosing random features by score functions in the kernel polarization method (Shahrampour et al., 2018) , among others, require a huge computational burden. Instead, the SRF model generates the random features randomly from a simple pre-specified distribution. In comparison to single hidden-layer neural networks (NN-1,(Rumelhart et al., 1986)  ): f (x) = k j=1 σ(w j x + b j ), where k is the number of units in the hidden layer. NN-1 requires to estimate pk parameters {w j } k j=1 , while RF models estimate N linear coefficients {α j } N j=1 . The Projection Pursuit Regression ([PPR,(Friedman & Stuetzle, 1981) ) combines GAM and NN-1 by estimating nonlinear functions f j and projected directions w j simultaneously, that is, f (x) = k j=1 f j (w j x), which requires extensive computations when p and/or k are large. Furthermore, it is known that we usually require large N to obtain a good approximation on the function space. However, when the number of random features, N , is large, directly estimating the combination coefficients of supervised random features using the ridge regression equation 2 is computationally burdensome. The proposed SPF divide all random features into K N blocks. For each block, the ridge regression is adopted to obtain initial predictions on the outcome y. Then, PPR is used on the low-dimensional (K) predictors to obtain the final prediction. This step further improves the prediction accuracy by adaptively estimating the combination schemes, in addition to save computational time by avoiding directly running large dimensional ridge regression but in a scalable way. Second, model flexibility and automatic calibration (Wilson et al., 2016) . Similarly to generalized additive models (GAM, (Hastie, 2017) ), i.e, f (x) = p j=1 f j (x j ), RF models overcome the curse of dimensionality by mapping p-dimensional covariates into one dimensional random feature, i.e., θ, x . Different from GAM, the RF model has the capacity to model interactions between covariates using the projected direction θ. The proposed SRF estimates the activation functions in a supervised way for each random feature, which avoids any subjective pre-specified fixed kernel space. It adaptively estimate each function and thus allows different function spaces on each random feature. Therefore, the proposed SRF allows a more complex function space on the variables x without knowing the true space they belong to. Consequently, the proposed SRF model has a more stable prediction errors in comparison to conventional RF models. Third, model simplicity. Different from multi-layer neural network, the SRF model needs two layers only to achieve good prediction accuracy. As described in the following section, the first layer is a 'nonparametric' random feature through the nonparametric regression method, and the second layer is the projection pursuit, a universal approximator in terms of theoretically approximating any continuous function in R p very well, which is extremely useful for regression forecasting due to its semi-parametric structure. More importantly, the estimation of these two layers can be easily solved by using common statistical methods without the need for extensive manual tuning from the users. Finally, model interpretability. As is known to all, neural network is lack of a principled framework to guide choosing parameters, such as architecture, activation functions or optimizers (Wilson et al., 2016) . This, combining with unidentification of parameters, leads to the uninterpretability of neural network. Fortunately, our SRF model enjoys good interpretability to some extent. For instance, as we mentioned before, RF model uses linear learning methods to replace nonlinear kernel methods. The biggest advantage of linear learning methods is its interpretability in terms of the coefficient. Significant coefficients implies important directions θ, x (Liu et al., 2020) , which facilitates the interpretation and understanding the underlying important features. The rest of the paper is organized as follows. Section 2 introduces proposed SRF with details and algorithm. Section 3 compares the proposed SRF method with other statistical methods under various types of simulated data. Section 4 considers five RWD(Real World Data) examples to evaluate the performance of the proposed SRF method. In Section 5, we summarize this paper with concluding remarks.

2. SUPERVISED RANDOM FEATURE

Consider the problem: Y i = f 0 (x i ) + ε i , where x i ∈ R p is a p-dimensional vector, and the function f 0 is unknown. The random errors ε i , 1 ≤ i ≤ n, are independent of each other and of x i . Assume E(ε i ) = 0 and E(ε 2 i ) = σ 2 < ∞. When the dimension p is larger than 3, the curse of dimensionality in the nonparametric regression occurs. We now introduce the proposed supervised random feature model, denoted as SRF. Firstly, for each random feature, θ j , x , we calculate its prediction on the outcome Y . That is, Y i = f j ( θ j , x i ) + ε i , i = 1, • • • , n, where f j (•) is an unknown univariate non-parametric function. Denote its estimator as fj , an initial prediction, which can be obtained easily using any nonparametric tools, such as K-NearestNeighbor(KNN), and Kernel density estimation or Kernel regression from Python package statsmodels.nonparametric. It is worthy of pointing out that for each RF θ j , x , we estimate the activating function in a supervised way. By doing this way, we avoid the misspecification issue on the kernel space. Second, the adaptive way on the kernel space relaxes the restriction on the distribution of the random index parameter θ j . In other words, we can simply sampling θ j from a unit ball, and then adaptively estimate the corresponding activating function with the outcome information incorporated. Third, for each RF, the underlying kernel space may not be the same. Thus, with each function estimated independently, we actually obtain a multi-kernel mixed space, which largely improve the prediction power compared to the single-kernel space especially for complex task. Secondly, we further refine the prediction in an aggregated way by minimizing the following ridge-type objective similar to conventional RF models: 1 n n i=1   Y i - N j=1 α j fj ( θ j , x i )   2 + N λ p α 2 2 . ( ) Denote the prediction as fSRF -I := N j=1 αj fj ( θ j , x i ). By treating each initial prediction fj as a candidate model, the SRF method shares similar idea as the stacking methods in model averaging literature (Yao et al., 2018) . That is, we aggregate each predictions fj through the weights α j , obtained by minimizing a least-square type criterion. Clearly, weights α j 's could be positive or negative. Instead of all positive weights in conventional modeling averaging methods, allowing positive and negative weights improves prediction power especially when candidate models do not cover the underlying true model (Arce, 1998) . Different from stacking methods, SRF involves random features θ j leading to a clear identification on important features whose corresponding coefficient α j is usually large, as show in Simulation-Part three. To avoid the computation complexity with large N , we further divide N into K blocks with equal dimension in each without loss of generality. Then within each block, we run equation 5 to obtain raw predictions f (1) k (x i ) = N k j=1 αk j fj ( θ j , x i ). Base on K predictors, f (1) k (•), k = 1, • • • , K , we obtain a further refined prediction by minimizing the following objective 1 n n i=1 Y i -g K k=1 β (1) k f (1) k (x i ) 2 , ( ) where g is an unknown nonparametric function. This step further improves prediction accuracy through the non-parametric aggregation link function g and additional weight parameters β k . Specifically, the non-parametric aggregation link function g extracts interaction information for each features, and the product term α k j × β k enjoys the ability of extracting hierarchical information from each feature, which is similar to the two-layer NN. Different from the two-layer NN, who prespecifies an activation function and the final link function, the proposed SRF estimates each activating function f j and the final link function g with the outcome information incorporated. Thus, it has higher prediction power due to the use of multi-kernel mixed space and flexible interaction expression with the non-parametric form, in a supervised way. The estimator can be obtained by PPR. The final prediction is defined as fSRF -II = ĝ K k=1 β(1) k f (1) k (x i ) . The entire procedure is given in the following Algorithm 1. Algorithm 1 Algorithm for SRF-II 1: SRF{y i , x i } n i=1 , N , K {Input} 2: Randomly generate N directions θ j , j = 1, • • • , N : θ j ∼ Unif(S p-1 ( √ p)). 3: Obtain supervised random features fj , j = 1, • • • , N using equation 4. 4: Obtain initial K raw esitmators f (1) k , k = 1, • • • , K by minimizing equation 5. 5: Obtain f (x) = ĝ K k=1 β(1) k f (1) k (x) by minimizing the objective equation 6. 6: return f . {Output}

3. SIMULATION STUDIES

This section evaluates the performance of the proposed SRF method based on various types of simulated data. We compare the prediction results with other statistical methods including Basic Random Feature Regression (Relu-I), One Layer Kernel Regression (SRF-I), Advanced Basic Random Feature Regression (Relu-II), Two Layers Kernel Regression -Projection Pursuit (SRF-II), Random Forest, One Layer Neural Network (NN-1) and Two Layers Neural Network (NN-2). We consider four settings for the regression function f 0 (•): (a) Linear: f 0 (X) = 2X 1 + X 2 + 3X 3 ; (b) Composite: f 0 (X) = cos X 1 + cos(X 1 ) + X 2 2 + e X2/3 + X 5 X 3 X 4 + cos(X 5 ) + 2X 6 + X 2 7 + X 8 X 9 + X 10 ; (c) Nonlinear: f 0 (X) = (X 1 + X 2 + X 3 ) 2 + 1; (d) More complex: f 0 (X) = cos X 1 + 2X 1 X 2 + X 2 3 + sin(X 4 ) + exp(X 5 ) + exp(X 6 + X 7 ) + cos(X 8 + X 9 + X 2 10 ) + sin(X 1 + X 5 ) 2 . Here, X i represents the ith dimension of X ∈ R p . Under all settings, we generate n = 300 data with p = 100 covariates. We generate covariates X from a multivariate normal distribution, N (0, Σ), with three correlation structures: (I) Independence, i.e., Σ = I. (II) Fixed correlation structure, i.e., all off-diagonal component of Σ equals to 0.5. (III) Random correlation structure, that is, each off-diagonal component of Σ is randomly generated from a uniform distribution Unif(-1, 1).The random error in the regression function Y = f 0 (X) + , is generated from a normal distribution, i.e., ε ∼ N (0, 0.1). We replicate each simulation 100 times. To determine the number of random features N , extensive simulation results show that when N = 12000 and N = 24000 for independent and correlated covariates, respectively, the prediction accuracy keeps steady and larger N does not bring significant improvement, as show in the Figure 4 (See Appendix A). Thus, for computational simplicity, we take N = 12000 and N = 24000 for simplicity. The larger N required for the correlated covariates is understandable, since randomly generated RF's may be correlated and share similar information due to the correlation among covariates leading to a larger N to thoroughly capture covariates information. The regularization parameter λ is determined by the model complexity. Basically, in order to ensure the stability of the model and the accuracy of the estimation, larger λ is used as the model complexity increased. We compare the prediction performance in terms of predicted mean square error (MSE) ,i.e 1 n n i=1 (ŷ i -y i ) 2 and Scaled MSE, i.e 1 n n i=1 ( ŷi-yi yi/2 ) 2 , with additional data with size 100. To ensure visualization of simulation results, we exclude 5% -10% outliers, and the number 2 in the denominator of Scaled-MSE amplifies the results and is more conducive to observing the difference in visualization. The results for models (a)-(d) are showed in Table 1 and Figure 1 under independent covariates (I). Particularly, Table 1 summarizes the average of the Scaled-MSE and MSE based on 100 replications, and Figure 1 shows the box-plot for the Scaled-MSE. Based on Table 1 and Figure 1 , we can see that when the model is simple, as linear model in scenario (a), two-layer type methods does not have better prediction accuracy than their corresponding one-layer counterparts. Interestingly, Scaled-MSE and MSE have different preferences on models. SRF-I, SRF-II and NN-1 have smaller Scaled-MSEs, while Relu-II and Random Forest have smaller MSEs. The reason for this problem may be that the linear model is too simple, and kernel regression and projection pursuit are prone to over-fitting. SRF shows its advantage as the model complexity increased. For models (b) and (c), SRF-I and SRF-II have better performance than others in terms of both Scaled-MSE and MSE. Also, because of the complexity of the composite function, all the two layers models perform better than their corresponding one layer counterparts. However, Neural Network do not show their advantages under models (b) and (c). Further, for more complex data as model (d), Neural Network works well. SRF-I and SRF-II defeat Relu-I and Relu-II in terms of Scaled-MSE. Random Forest still remains stable. It is worth noting that under this scenario, the two-layer models have significantly better and more stable prediction performance than one-layer models besides Neural Network. In order to show the interpretability, we consider the following two criterions under model (c). For the first criterion, we calculate the difference of sum of absolute weights ω on the three important covariates between the maximum absolute α max = max j |α j | and the minimum absolute α min = min j |α j |. Ideally, a large absolute value of α j implies the more importance of the feature. Thus, we compare the proportion of the differences larger than 0 out of 50 replicates for 20 α's at a time, termed as Maxmin. For comparison, we also compare the difference between randomly chosen α j and α min , termed as Ranmin. For the second criterion, we compare the significant ω elements (the first three elements) with the non-significant ω elements. For comparison, we consider three non-significant ω at prespecified fixed positions (Fixpos) or randomly chosen positions (Ranpos). Also, we compare the difference between randomly chosen three covariates and randomly chosen covariates (Ranran). Similarly, we calculate the proportions of the differences larger than 0 out of 50 replications for 20 α's at a time. The results are show in Figure 2 . From Figure2 we can see that, Ranpos and Fixpos have significantly larger proportions, up to 56% improvement on average, than Ranran, indicating that important features do have larger values on ω. Maxmin also has significantly larger proportion than Ranmin, up to 80% improvement on average, implying that larger α j does represent an important direction. Therefore, SRF-II is meaningful to determine important directions with large value of α j . 

4. REAL DATA EXAMPLES

List of datasets. Our real data experiments consider the following datasets. All the datasets are publicly available. More details about these datasets, including the size of the data and the number of features, are provided in Table 2 . • Abalone was collected form UCI (University of California-Irvine) Machine Learning Repository with data size n= 800. Its objective is to predict the age of abalone (Number of rings) based on individual abalone measurements. It contains seven features including length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight. • Boston was collected from Sklearn Machine Learning Repository with data size n = 478. This data set is about Boston House Prices and is one of the most famous regression task datasets. It contains thirteen features, which are CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B and LSTAT(See Appendix C for details). The objective is to predict how those aforementioned features affected the house price, MEDV(Median value of owner-occupied homes in $ 1000's ). • Wine was also collected from UCI Machine Learning Repository with data size n =1000. The white wine quality data set contains eleven features, which are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, PH value, sulphates, and alcohol. These eleven independent variables are used to predict the quality(based on sensory data) of each white wine. • Auto MPG was collected from Kaggle.com. This MPG data is about n = 393 automobile fuel consumptions in miles per gallon with three multi-valued discrete attributes and five continuous attributes. These eight attributes are MPG(miles per gallon), number of engine cylinders, engine displacement, horsepower, vehicle weight, acceleration, model year, and origin. The objective is to predict MPG based on the other seven features. • Song Popularity was collected from Kaggle.com. Recently, there has been increasing research work into the relationship between the popularity of a song and its certain factors. The main goal is to predict a song's popularity based on several factors. In this dataset, thirteen factors including song duration, acousticness(electronic music or not, 0 to 1), danceability, energy, instrumentalness(pure music or not, 0 to 1), key, liveness, loudness, audio mode, speechiness, tempo, time signature and audio valence(positive or negative psychological feelings, 0 to 1), were collected. For ease of illustration, the data with key 4 is considered leading to n= 1307 data size. In the data preprocesses, min-max normalization for each continuous variable is applied for all datasets except Auto MPG and Song Popularity, in which, z-score normalization is used. Prediction results. All predicted MSEs and Scaled-MSEs (with standard deviation) are reported in Table 2 . It can be seen from Table 2 that, SRF-II has the best performance in terms of both Scaled-MSEs and MSEs, with the reduction in Scaled-MSEs and MSEs compared to Relu-I ranges from 9 -34% and 2 -28%, respectively, for all datasets. With limited data size, NN-2 and NN-1 have instability issues. Particularly, for dataset Abalone, conventional RF models and the proposed SRF have comparable performance, and two-layer models do not show there advantages compared with one-layer counterparts. This is probably due to the underlying simple structure of the data. For the other four datasets, two-layer models have smaller prediction errors than their corresponding one-layer counterparts. Interpretability results. In the previous simulation section, we have confirmed that SRF-II is meaningful to determine important directions with large value of α j . Thus, for the real data examples, we identify the significant variables according to the magnitude of the absolute values of ω. Particularly, we identify significant variables by integrating a ranking of 50 times for every dataset. The proportion of each variable in the first few ranks (depending on the number of features) in each dataset is reported in Figure 3 . • Abalone Due to the characteristic of the dataset itself, there are no significant and nonsignificant variables, which means that each variable has a certain effect on abalone age. • Boston RM,LSTAT and B are three significant variables in this dataset. The significance of INDUS and AGE were moderate. According to the analysis from github.com, LSTAT and RM have the biggest correlation coefficient with MEDV. Persons tend to have a lower proportion of low status people around their houses, and more rooms imply a bigger house. Interestingly, we found that B is also an important factor for MEDV unlike what others have found. Other variables have less obvious effects on MEDV. • Wine In this dataset, alcohol, volatile acidity, and chlorides are relatively significant compared to the others. Because of the nature of wine, the best quality is achieved by a balance of all variables, so there are no particularly significant variables in this dataset. First of all, it is often said that the higher the alcohol content, the better the wine. Then, too much volatile acidity can cause the wine to smell pungent. At last, the right amount of chloride can extend the life of a wine, but too much can produce an unpleasant taste. • Auto MPG For this dataset, weight and displacement have obvious significant characteristic, and year and horsepower are kind of significant. Other variables are less likely to affect the value of MPG. Based on other people's correlation analysis of this dataset on Kaggle.com, the largest absolute correlation coefficients with MPG are weight and displacement. This conclusion is consistent with our results. Through common sense, we clearly know that the heavier the car, the bigger the MPG, and then the bigger the displacement, the bigger the MPG. It is worth mentioning that the year will also have an effect on the value of MPG, because the higher the year, the more serious the aging of auto parts, which will lead to the increase of MPG.The other variables are not significant because their correlation coefficients with weight, displacement and horsepower are too high. • Song Popularity Our results are not exactly the same as other people's correlations conclusion on Kaggle.com, but they are roughly the same. Audio valence and loudness are two significant variables in this dataset. Also, acousticness, danceability and instrumentalness are kind of significant. Everyone loves to listen to songs with positive psychological feeling, so it is easy to understand that audio valence has a great influence on song popularity. And, few people like loud songs, so loudness is also a significant variable. Electronic music, dance music and pure music have unique audiences, so they have some influence on song popularity. However, for other variables, such as liveness or tempo, the audience doesn't pay much attention to those. Therefore, they are non-significant variables. 

5. CONCLUDING REMARKS

This paper proposed a novel and computational efficient method for general nonparametric problems. To the best of our knowledge, we are the first to propose combining the advantages of random features and neural networks to come up with a new feed-forward two-layer nonparametric estimation method. Extensive simulation results and experimental data show that SRF-II has excellent prediction performance and good interpretation. More specifically, the proposed SRF-II improved the prediction error compared to Relu-I ranges from 9 -34% and 2 -28% across five datasets. More importantly, SRF-II performs well with limited data size and thus is energetic efficient. However, there are still three limitations in this paper. At first, for computational simplicity, we take N = 12000 and N = 24000 in these cases. However, optimal choices or clear criterions are not clear yet. Secondly, in this paper, we only consider the regression problem, not the classification problem. Classification problems based on RF require extra computational burden due to the non linear structure compared to least squares. Nevertheless, extending SRF-II to classification problems, especially for image classification problems, are very meaningful. In the end, we did not yet consider variable selection methods to generate a more simplified model, which well deserved further studies to improve the prediction accuracy and computational efficiency further. A APPENDIX C APPENDIX 



Scaled-MSEs and MSEs for models (a)-(d) of various methods including Relu-I, SRF-I, Relu-II, SRF-II, Random Forest, NN-1 and NN-2. Two-layer type methods have no obvious advantages for linear models. As model complexity increases, for composite and non-linear models, SRF-I and SRF-II perform better than others for both Scaled-MSE and MSE. Further, for more complex model , SRF-I and SRF-II work well on Scaled-MSE. NN-1 and NN-2 has comparable performance under model (d) setting.

Figure 1: Box-plot of Scaled-MSEs for models (a)-(d) under independent covariates (I). Clearly, except under linear model setting (a), SRF-II has smallest Scaled-MSEs under all other models.

Figure 2: The average times of differences larger than 0 out of 50 replicates under five different indexes, Maxmin, Ranmin, Fixpos, Ranpos, and Ranran, respectively. Compared to Ranran and Ranmin, Larger times in Maxmin, Fixpos and Ranpos show that larger α values are related to important features.

Figure 3: The proportion of each variable in the first few ranks (depending on the number of features) in each dataset is shown. A Longer color bar in the histogram represents a higher proportion, implying the significant level of the feature. It's easy to see that Abalone and Wine don't have obviously significant variables. For Boston data, RM,LSTAT and B are three significant variables. For Auto MPG, weight and displacement have obvious significant characteristic. Audio valence and loudness are two significant variables in the data Song Popularity.

Figure 4: The box figures of scaled predicted mean square error under model (c) for independent covariates generated from (I) with N = 6000, 12000, 18000 and correlated covariates generated from (III) with N = 12000, 24000, 36000.

Prediction results including Scaled-MSEs (with standard deviation) and MSEs for all five datasets as well as data size, number of features, and publicly available or not. In data Abalone, Proposed SRF methods do not show obvious advantage. For the other four datasets, two-layer models defeat their corresponding one-layer counterparts. The proposed SRF performs best on the dataset Boston,Auto MPG, and Song Popularity. For all the five datasets, the proposed SRF-II works particularly well. ( : Due to the extreme instability of two-layers Neural Network, part of the error data was deleted. )

Bk -0.63) 2 where Bk is the proportion of blacks by town -LSTAT % lower status of the population

-MEDV

Median value of owner-occupied homes in $ 1000's

