OPTIMAL ACTIVATION FUNCTIONS FOR THE RANDOM FEATURES REGRESSION MODEL

Abstract

The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.

1. INTRODUCTION

For many neural network (NN) architectures, the test error does not monotonically increase as a model's complexity increases but can go down with the training error both at low and high complexity levels. This phenomenon, the double descent curve, defies intuition and has motivated new frameworks to explain it. Explanations have been advanced involving linear regression with random covariates (Belkin et al., 2020; Hastie et al., 2022) , kernel regression (Belkin et al., 2019b; Liang & Rakhlin, 2020) , the neural tangent kernel model (Jacot et al., 2018) , and the Random Features Regression (RFR) model (Mei & Montanari, 2022) . These frameworks allow queries beyond the generalization power of NNs. For example, they have been used to study networks' robustness properties (Hassani & Javanmard, 2022; Tripuraneni et al., 2021) . One aspect within reach and unstudied to this day is finding optimal Activation Functions (AFs) for these models. It is known that AFs affect a network's approximation accuracy and efforts to optimize AFs have been undertaken. Previous work has justified the choice of AFs empirically, e.g., Ramachandran et al. (2017) , or provided numerical procedures to learn AF parameters, sometimes jointly with models' parameters, e.g. Unser (2019). See Rasamoelina et al. (2020) for commonly used AFs and Appendix C for how AFs have been previously derived. We derive for the first time closed-form optimal AFs such that an explicit objective function involving the asymptotic test error and sensitivity of a model is minimized. Setting aside empirical and principled but numerical methods, all past principled and analytical approaches to design AFs focus on non accuracy related considerations, e.g. Milletarí et al. (2019) . We focus on AFs for the RFR model and expand its understanding. We preview a few surprising conclusions extracted from our main results: 1. The optimal AF can be linear, in which case the RFR model is a linear model. For example, if no regularization is used for training, and for low complexity models, a linear AF is often preferred if we want to minimize test error. For high complexity models a non-linear AF is often better; 2. A linear optimal AF can destroy the double descent curve behaviour and achieve small test error with much fewer samples than e.g. a ReLU; 3. When, apart from the test error, the sensitivity of a model becomes important, optimal AFs that without sensitivity considerations were linear can become non-linear, and vice-versa; 4. Using an optimal AF with an arbitrary regularization during training can lead to the same, or better, test error as using a non-optimal AF, e.g. ReLU, and optimal regularization.

1.1. PROBLEM SET UP

We consider the effect of AFs on finding an approximation to a square-integrable function on the -dimensional sphere S -1 ( √ ), the function having been randomly generated. The approximation is to be learnt from training data D = {x , } =1 where x ∈ S -1 (

√

), the variables {x } =1 are i.i.d. uniformly sampled from S -1 ( ), and = (x ) + , where the noise variables { } =1 are i.i.d. with E( ) = 0, E( 2 ) = 2 , and E( 4 ) < ∞. The approximation is defined according to the RFR model. The RFR model can be viewed as a two-layer NN with random first-layer weights encoded by a matrix Θ ∈ R × with th row θ ∈ R satisfying θ = , with {θ } i.i.d. uniform on S -1 ( √ ), and with to-be-learnt second-layer weights encoded by a vector a = [ ] =1 = R . Unless specified otherwise, the norm • denotes the Euclidean norm. The RFR model defines a,Θ : S -1 ( √ ) ↦ → R such that a,Θ (x) = =1 ( θ , x / √ ). where (•) is the AF that is the target of our study and , denotes the inner product between vectors and . When clear from the context, we write a,Θ as , omitting the model's parameters. The optimal weights a ★ are learnt using ridge regression with regularization parameter ≥ 0, namely, a ★ = a ★ ( , D) = arg min a∈R 1 =1 - =1 ( θ , x / √ ) 2 + a 2 . ( ) We will tackle this question: What is the simplest that leads to the best approximation of ? We quantify the simplicity of an AF with its norm in different functional spaces. Namely, either 1 E(| ( ))|), or (3) 2 E(( ( )) 2 ), where is the derivative of and the expectations are with respected to a normal random variable with zero mean and unit variance, i.e. ∼ N (0, 1). For a comment on these choices please read Appendix A. We quantify the quality with which = a ★ ,Θ approximates via , a linear combination of the mean squared error and the sensitivity of to perturbations in its input. For ∈ [0, 1], x uniform on S -1 ( √ ), we define (1 -)E + S, (5) where E E(( (x) -(x)) 2 ), (6) and S E ∇ x (x) 2 . ( ) See Appendix B for a comment on our choice for sensitivity. Like in Mei & Montanari (2022); D'Amour et al. ( 2020), we operate in the asymptotic proportional regime where , , → ∞, and have constant ratios between them, namely, / → 1 and / → 2 . In this asymptotic setting, it does not matter if in defining ( 6) and ( 7), in addition to taking the expectation with respect to the test data x, independently of D, we also take expectations over D and the random features in RFR. This is because when , , → ∞ with the ratios defined above, E and S will concentrate around their means (Mei & Montanari, 2022; D'Amour et al., 2020) . Mathematically, denoting by either (4) or (3), our goal is to study the solutions of the problem min ★ ★ subject to ★ ∈ arg min ( ). Notice that the outer optimization only affects the selection of optimal AF in so far as the inner optimization does not uniquely define ★ , which, as we will later see, it does not. To the best of our knowledge, no prior theoretical work exists on how optimal AFs affect performance guarantees. We review literature review on non-theoretical works on the design of AFs, and a work studying the RFR model for purposes other than the design of AFs in Appendix C.

2. BACKGROUND ON THE ASYMPTOTIC PROPERTIES OF THE RFR MODEL

Here we will review recently derived closed-form expressions for the asymptotic mean squared error and sensitivity of the RFR model, which are the starting point of our work. First, however, we explain the use-inspired reasons for our setup. Our assumptions are the same as, or very similar to, those of published theoretical papers, e. 



g. Jacot et al. (2018); Yang et al. (2021); Ghorbani et al. (2021); Mel & Pennington (2022).

