OPTIMAL ACTIVATION FUNCTIONS FOR THE RANDOM FEATURES REGRESSION MODEL

Abstract

The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.

1. INTRODUCTION

For many neural network (NN) architectures, the test error does not monotonically increase as a model's complexity increases but can go down with the training error both at low and high complexity levels. This phenomenon, the double descent curve, defies intuition and has motivated new frameworks to explain it. Explanations have been advanced involving linear regression with random covariates (Belkin et al., 2020; Hastie et al., 2022) , kernel regression (Belkin et al., 2019b; Liang & Rakhlin, 2020) , the neural tangent kernel model (Jacot et al., 2018) , and the Random Features Regression (RFR) model (Mei & Montanari, 2022) . These frameworks allow queries beyond the generalization power of NNs. For example, they have been used to study networks' robustness properties (Hassani & Javanmard, 2022; Tripuraneni et al., 2021) . One aspect within reach and unstudied to this day is finding optimal Activation Functions (AFs) for these models. It is known that AFs affect a network's approximation accuracy and efforts to optimize AFs have been undertaken. Previous work has justified the choice of AFs empirically, e.g., Ramachandran et al. (2017) , or provided numerical procedures to learn AF parameters, sometimes jointly with models' parameters, e.g. Unser (2019) . See Rasamoelina et al. (2020) for commonly used AFs and Appendix C for how AFs have been previously derived. We derive for the first time closed-form optimal AFs such that an explicit objective function involving the asymptotic test error and sensitivity of a model is minimized. Setting aside empirical and principled but numerical methods, all past principled and analytical approaches to design AFs focus on non accuracy related considerations, e.g. Milletarí et al. (2019) . We focus on AFs for the RFR model and expand its understanding. We preview a few surprising conclusions extracted from our main results: 1. The optimal AF can be linear, in which case the RFR model is a linear model. For example, if no regularization is used for training, and for low complexity models, a linear AF is often preferred if we want to minimize test error. For high complexity models a non-linear AF is often better; 2. A linear optimal AF can destroy the double descent curve behaviour and achieve small test error with much fewer samples than e.g. a ReLU; 3. When, apart from the test error, the sensitivity of a model becomes important, optimal AFs that without sensitivity considerations were linear can become non-linear, and vice-versa; 4. Using an optimal AF with an arbitrary regularization during training can lead to the same, or better, test error as using a non-optimal AF, e.g. ReLU, and optimal regularization. Published as a conference paper at ICLR 2023 1.1 PROBLEM SET UP We consider the effect of AFs on finding an approximation to a square-integrable function on the -dimensional sphere S -1 ( √ ), the function having been randomly generated. The approximation is to be learnt from training data D = {x , } =1 where x ∈ S -1 ( √ ), the variables {x } =1 are i.i.d. uniformly sampled from S -1 (

√

), and = (x ) + , where the noise variables { } =1 are i.i.d. with E( ) = 0, E( 2 ) = 2 , and E( 4 ) < ∞. The approximation is defined according to the RFR model. The RFR model can be viewed as a two-layer NN with random first-layer weights encoded by a matrix Θ ∈ R × with th row θ ∈ R satisfying θ = , with {θ } i.i.d. uniform on S -1 ( √ ), and with to-be-learnt second-layer weights encoded by a vector a = [ ] =1 = R . Unless specified otherwise, the norm • denotes the Euclidean norm. The RFR model defines a,Θ : S -1 ( √ ) ↦ → R such that a,Θ (x) = =1 ( θ , x / √ ). where (•) is the AF that is the target of our study and , denotes the inner product between vectors and . When clear from the context, we write a,Θ as , omitting the model's parameters. The optimal weights a ★ are learnt using ridge regression with regularization parameter ≥ 0, namely, a ★ = a ★ ( , D) = arg min a∈R 1 =1 - =1 ( θ , x / √ ) 2 + a 2 . (2) We will tackle this question: What is the simplest that leads to the best approximation of ? We quantify the simplicity of an AF with its norm in different functional spaces. Namely, either 1 E(| ( ))|), or 2 E(( ( )) 2 ), where is the derivative of and the expectations are with respected to a normal random variable with zero mean and unit variance, i.e. ∼ N (0, 1). For a comment on these choices please read Appendix A. We quantify the quality with which = a ★ ,Θ approximates via , a linear combination of the mean squared error and the sensitivity of to perturbations in its input. For ∈ [0, 1], x uniform on S -1 ( √ ), we define (1 -)E + S, (5) where E E(( (x) -(x)) 2 ), (6) and S E ∇ x (x) 2 . (7) See Appendix B for a comment on our choice for sensitivity. Like in Mei & Montanari (2022); D'Amour et al. (2020) , we operate in the asymptotic proportional regime where , , → ∞, and have constant ratios between them, namely, / → 1 and / → 2 . In this asymptotic setting, it does not matter if in defining ( 6) and ( 7), in addition to taking the expectation with respect to the test data x, independently of D, we also take expectations over D and the random features in RFR. This is because when , , → ∞ with the ratios defined above, E and S will concentrate around their means (Mei & Montanari, 2022; D'Amour et al., 2020) . Mathematically, denoting by either (4) or (3), our goal is to study the solutions of the problem min ★ ★ subject to ★ ∈ arg min ( ). Notice that the outer optimization only affects the selection of optimal AF in so far as the inner optimization does not uniquely define ★ , which, as we will later see, it does not. To the best of our knowledge, no prior theoretical work exists on how optimal AFs affect performance guarantees. We review literature review on non-theoretical works on the design of AFs, and a work studying the RFR model for purposes other than the design of AFs in Appendix C.

2. BACKGROUND ON THE ASYMPTOTIC PROPERTIES OF THE RFR MODEL

Here we will review recently derived closed-form expressions for the asymptotic mean squared error and sensitivity of the RFR model, which are the starting point of our work. First, however, we explain the use-inspired reasons for our setup. Our assumptions are the same as, or very similar to, those of published theoretical papers, e.g. Jacot et al. (2018) ; Yang et al. (2021) ; Ghorbani et al. (2021) ; Mel & Pennington (2022) . 1. Data on a sphere: Normalization of input data is a best practice when learning with NNs (Huang et al., 2020) . Assuming that input data lives on a sphere is one type of normalization. 2. Random features: The seminal work of Rahimi & Recht (2007a) showed the success of using random features on real datasets. For a recent review on their use see Cao et al. (2018) . 3. Asymptotic setting: Mei & Montanari (2022) empirically showed that the convergence to the asymptotic regime is relatively fast, even with just a few hundreds of dimensions. Most real world applications involve larger dimensions , lots of data , and lots of neurons . 4. Shallow architecture: For a finite input dimension , the RFR model can learn arbitrary functions as the number of features grows large (Bach, 2017; Rahimi & Recht, 2007b; Ghorbani et al., 2021) . Existing proof techniques make it very hard yet to extend our type of analysis to more than two layers or complex architectures. A few papers consider models with depth > 2 but do not tackle our problem and have other heavy restrictions on the model, e.g. Pennington et al. (2018) . 5. Regularization: Using regularization during training to control the weights' magnitude is common. It can help convergence speed and generalization error (Goodfellow et al., 2016) . For a review on different types of regularization for learning with NNs see Kukačka et al. (2017) . We make the following assumptions, which we assume hold in the theorems in this section. Assumption 1. We assume that the AF is weakly differentiable with weak derivative , it satisfies | ( )|, | ( )|≤ 0 1 | | ∀ ∈ R for some constants 0 < 0 , 1 < ∞, and that it also satisfies 0 = E{ ( )}, 1 = E{ ( )}, 2 = E{ ( ) 2 }, 2 ★ = 2 -2 0 -2 1 , = 1 / ★ , for some 0 , 1 , 2 ∈ R, where the expectations are with respect to ∼ N (0, 1). Assumption 2. We assume that = ( ) and = ( ) such that the following limits exist in (0, ∞): lim →∞ ( )/ = 1 and lim →∞ ( )/ = 2 . Assumption 3. We assume that = (x )+ , where { } ≤ ∼ . . . P are independent of {x } ≤ , with E( 1 ) = 0, E( 2 1 ) = 2 , E( 4 1 ) < ∞, expectations with respect to { }. Furthermore, (x) = ,0 + β ,1 , x + NL (x) , where ,0 ∈ R, β ,1 ∈ R are deterministic with lim →∞ 2 ,0 = 2 0 , lim →∞ β ,1 2 2 = 2 1 > 0. The non-linear NL is a centered Gaussian process indexed by x ∈ S -1 (

√

), with covariance E NL { NL (x 1 ) NL (x 2 )} = Σ ( x 1 , x 2 / ), where (12) Informally, ★ quantifies how non-linear the AF is (cf. Lemma 3.1), 1 quantifies the complexity of the RFR model relative of the dimension , 2 quantifies the amount of data used for training relative to , 2 is the variance of the observation noise, 1 is the magnitude of the linear component of our target function , which is controlled by ,1 , ★ is the magnitude of the non-linear component NL in the target function, and is the ratio between the magnitude of the linear component and the magnitude of all of the sources of randomness in the noisy function + . Recall that all of our results will be derived in the asymptotic regime when → ∞. Σ (•) satisfies E x∼Unif(S -1 ( √ )) {Σ ( 1 / √ )} = 0, E x∼Unif(S -1 ( √ )) {Σ ( 1 / √ ) 1 } = 0, Our contributions are divided into two parts, Section 3.1 and Section 3.2. The theorems' statements in Section 3.2 quickly get prohibitively complex as they are stated more generally, with lots of special cases having to be discussed. Hence, in Section 3.2 we display our analysis on the following three different important regimes: 1 : Ridgeless limit regime, when → 0 + ; 2 : Highly overparameterized limit, when 1 → ∞; 3 : Large sample limit, when 2 → ∞. Section 3.1's results are general and not restricted to these regimes. In the context of the RFR model, these regimes were introduced and discussed in Mei & Montanari (2022) . For what follows we define / 2 ★ . Any "lim →∞ = " should be interpreted as converging to in probability with respect to the training data D, the random features Θ, and the random target as → ∞.

2.1. ASYMPTOTIC MEAN SQUARED TEST ERROR OF THE RFR MODEL

The following theorems are a specialization of a more general theorem, Theorem 12 Mei & Montanari (2022) , which we include in the Appendix G for completeness. Theorem 1 (Theorem 3 Mei & Montanari (2022) ). The asymptotic test error (6) for regime 1 equals E ∞ 1 ≡ lim →0 + lim →∞ E = 2 1 B rless ( , 1 , 2 ) + ( 2 + 2 ★ )V rless ( , 1 , 2 ) + 2 ★ , where B rless ( , 1 , 2 ) ≡ E 1,rless /E 0,rless ,V rless ( , 1 , 2 ) ≡ E 2,rless /E 0,rless , and the functions E 0,rless , E 1,rless and E 2,rless are polynomials that are functions of 2 , 1 , 2 and , where is a function of ≡ min{ 1 , 2 } and 2 . See Appendix D for details. Remark 1. As a function of 1 , E ∞ 1 has a discontinuity at 1 = 2 called the interpolation threshold. For 2 high enough, and for 1 < 2 , E ∞ 1 decreases, reaches a minimum and then explodes approaching 2 . However, past 2 , E ∞ 1 decreases again with 1 . This double descent behavior has been observed/studied in many settings, including Mei & Montanari (2022) and references therein. Theorem 2 (Theorem 4 Mei & Montanari (2022) ). The asymptotic test error (6) for regime 2 equals E ∞ 2 ≡ lim 1 →∞ lim →∞ E = 2 1 B wide ( , 2 , ) + ( 2 + 2 ★ )V wide ( , 2 , ) + 2 ★ , where B wide and V wide are defined in Appendix E Theorem 3 (Theorem 5 Mei & Montanari (2022) ). The asymptotic test error (6) for regime 3 equals E ∞ 3 ≡ lim 2 →∞ lim →∞ E = 2 1 B lsamp ( , 1 , / 2 ★ ) + 2 ★ ( ) where B lsamp ( , 1 , / 2 ★ ) is defined in Appendix F 2.2 ASYMPTOTIC SENSITIVITY OF THE RFR MODEL We derive a sensitivity formula for regimes 1 , 2 , 3 . Our theorems are a specialization (proofs in Appendix M) of the more general Theorem 13 that we include in the Appendix G for completeness. Theorem 4. The sensitivity (7) for regime 1 equals S ∞ 1 ≡ lim →0 + lim →∞ S = 2 2 1 D 1,rless ( , 1 , 2 ) ( 2 -1)D 0,rless ( , 1 , 2 ) + ( 2 ★ + 2 )D 2,rless ( , 1 , 2 ) D 0,rless ( , 1 , 2 ) , where D 0,rless ( , 1 , 2 ), D 1,rless ( , 1 , 2 ), and D 2,rless ( , 1 , 2 ) are polynomials found in App. H. Theorem 5. Let 2 equal (32), defined in Appendix E. The sensitivity (7) for regime 2 equals S ∞ 2 ≡ lim 1 →∞ lim →∞ S = 2 2 (( 2 ★ + 2 )(-1 + 2 ) + 2 1 (-1 -2 + 2 (-1 + 2 ))) (-1 + 2 )( 2 -2 2 2 + 2 2 (-1 + 2 )) . ( ) Theorem 6. Let 1 equal (35), defined in Appendix F.. The sensitivity (7) for regime 3 equals S ∞ 3 ≡ lim 2 →∞ lim →∞ S = 2 1 (1 + (2/(-1 + 1 )) + ( 1 /( 1 -2 1 1 + 1 2 (-1 + 1 )))).

2.3. GAUSSIAN EQUIVALENT MODELS

A string of recent work shows that the asymptotic statistics of different models, e.g. their test MSE, is equivalent to that of a Gaussian model. This equivalence is known for the setup in (Mei & Montanari, 2022) , and also for other setups Hu & Lu (2020) ; Ba et al. (2022) ; Loureiro et al. (2021) ; Montanari & Saeed (2022) . Setups differ on the loss they consider, the type of regularization, the random feature matrices used, the training procedure, the asymptotic regime studied, or on the model architecture. In the Gaussian model equivalent to our setup, the AF constants 0 , 1 , and 2 appear as parameters. For example, ★ appears as the magnitude of noise added to the regressor matrix entries, and the non-linear part of the target in the RFR model appears as additive mismatch noise. As such, e.g., tuning ★ is related to an implicit ridge regularization. However, since in the Gaussian equivalent model ★ also appears as an effective model mismatch noise, tuning AFs leads to a richer behaviour than just tuning . Furthermore, tuning the AF requires tuning more than just one parameter, while tuning regularization only one, making our contribution in Sec. 3.2 all the more valuable. In fact, one of our contributions (cf. contribution 4 in Sec. 1) is quantifying the limitation of this connection: tuning AFs can lead to strictly better performance than tuning regularization (cf. Section 3.3). Gaussian equivalent models derive a good portion of their importance from their connection to the original models to which their equivalence is proved, and which are typically closer to real-world use of neural networks. By themselves, these Gaussian models are extremely simplistic and lack basic real-world components, such as the concept of AF that we study here. Hypothesizing an equivalence to a Gaussian models greatly facilitates analytical advances and numerous unproven conjectures have been put forth regarding how generally these equivalences can be established Goldt et al. (2022) ; Loureiro et al. (2021) ; Dhifallah & Lu (2020) .

2.4. ADVANTAGES AND LIMITATIONS OF STUDYING THE RFR MODEL

It is known (Mei & Montanari, 2022) that in the asymptotic proportional regime, the RFR cannot learn the non-linear component of certain families of non-linear functions, and in fact cannot do better than linear regression on the input for these functions. Ba et al. (2022) show that a single not-small gradient step to improve the initially random weights of RFR's first layer allows surpassing linear regression's performance in the asymptotic proportional regime. However, for not-small steps, no explicit asymptotic MSE or sensitivity formulas are given that one could use to tune AFs parameters. Also, Ba et al. (2022) , and others, e.g. Hu & Lu (2020) , work with a slightly different class of functions than Mei & Montanari (2022) , e.g. their AFs are odd functions, making comparisons not apples-to-apples. It is known that the RFR can learn non-linear functions in other regimes, e.g. ∼ poly( ), and asymptotic formulas for the RFR in this setting also exist Misiakiewicz (2022) . There is numerical evidence of the real-word usefulness of the RFR (Rahimi & Recht, 2007b) . Linear regression also exhibits a double descent curve in the asymptotic proportional regime (Hastie et al., 2019) . However, under e.g. overparameterization this curve exhibits a minimizer at a finite 2 , while empirical evidence for real networks shows that the error decreases monotonically as 2 → ∞. Therefore, linear regression is not as good as the RFR to explain observed double-descent phenomena. Furthermore, linear regression does not deal with AFs, which is our object of study. Finally, even in a setting where the RFR cannot learn certain non-linear functions with zero MSE, it remains an important question to study how much tuning AF can help improve the MSE and how this affects properties like the double descent curve.

3. MAIN RESULTS

We will find the simplest AFs that lead to the best trade-off between approximation accuracy and sensitivity for the RFR model. Mathematically, we will solve (8). From the theorems in Section 2 we know that E and S, and hence = (1 -)E + S, only depend on the AF via 0 , 1 , 2 . Therefore, we will proceed in two steps. In Section 3.1, we will fix 0 , 1 , 2 , and find with associated values 0 , 1 , 2 that has minimal norm, either (4) or (3). In Section 3.2, we will find values of 0 , 1 , 2 that minimize = (1 -)E + S. Together, these specify optimal AFs for the RFR model. It is the case that properties of the RFR model other than the test error and sensitivity also only depend on the AF via 0 , 1 , 2 . One example is the robustness of the RFR model to disparities between the training and test data distribution (Tripuraneni et al., 2021) . Although we do not focus on these other properties, the results in Section (3.1) can be used to generate optimal AFs for them as well, as long as, similar to in Section 3.2, we can obtain 0 , 1 , 2 that optimize these other properties. We made the decision to, as often as possible, simplify expressions by manipulating them to expose the signal to noise ratio = 2 1 /( 2 + 2 ★ ), 1 > 0, rather than using the variables 1 , , and ★ . The only downside is that conclusions in the regime = ★ = 0 require a bit more of effort to be extracted, often been readable in the limit → ∞. The complete proofs of our main results can be found in Appendix M and their main ideas below. The proofs of Section 3.2 are algebraically heavy and we provide a Mathematica file to symbolically check expressions of theorem statements and proofs in the supplementary material. 3.1 OPTIMAL ACTIVATION FUNCTIONS GIVEN FIXED 0 , 1 , AND 2 Since one of our goals is knowing when an optimal AF is linear we start with the following lemma. Lemma 3.1. The AF is linear (almost surely) if and only if 2 ★ 2 -2 1 -2 0 = 0. We now state results for the norms (4) and (3). The problem we will solve under both norms is similar. Let ∼ N (0, 1). We consider solving the following functional problem, where = 1 or 2, min subject to E( ( )) = 0 , E( ( )) = 1 , E( ( ) 2 ) = 2 , with ∼ N (0, 1). ( ) If = 2, we seek solutions over the Gaussian-weighted Lebesgue space of twice weak-differentiable functions that have E(( ( )) 2 ) and E(( ( )) 2 ) defined and finite. If = 1, we seek solutions over the Gaussian-weighted Lebesgue space of weak-differentiable functions that have E(( ( )) 2 ) and E(| ( )|) defined and finite. The derivative is to be understood in a weak sense. Since is a one-dimensional function, the requirement of existence of weak derivative implies that there exists a function that is absolute continuous and that agrees with almost everywhere (Rudin et al., 1976) . Therefore, any specific solution we propose should be understood as an equivalent class of functions that agree with up to a set of measure zero with respect to the Gaussian measure. Published as a conference paper at ICLR 2023 Theorem 7. The minimizers of (19) for = 2, i.e. 2 = E(( ( )) 2 ), are ( ) = 2 + + , where = ± ★ / √ 2, = 1 , and = 0 -. (20) In Theorem 7, if ★ = 0 there is only one minimizer, a linear function. If ★ > 0, there are exactly two minimizers, both quadratic functions. Note that both minimizers satisfy the growth constraints of Assumption 1, and hence can be used within the analysis of the RFR model. We note that quadratic AFs have been empirically studied in the past, e.g. Wuraola & Patel (2018) . Theorem 8. One minimizer of (19) for = 1, i.e. = E(| ( )|), is ( ) = 0 + max{min{ , -}, }, = 1 erf( / √ 2) , erf is the Gauss error function, and ∈ R is the unique solution to the equation 2 2 1 / 2 ★ = ( ) if ★ = 0, and = +∞ if ★ = 0, where is specified in Appendix I. When = E(| ( )|) , we can characterize the complete solution family to (19). These are AFs of the form ( ) = + max{ , min{ , }}, where , , , and are chosen such that the constraints in (19) hold. It is possible to explicitly write and as a function of 0 , 1 , , , and express , as the solution of ( , ) = 2 1 / 2 ★ , where (•, •) has explicit form. In this case, for each 0 , 1 , 2 there are an infinite number of optimal AFs since ( , ) = 2 1 / 2 ★ has an infinite number of solutions. ReLU's are included in this family as → ∞. The involved lengthy expressions do not bring any new insights, so we state and prove only Thr. 8, which is a specialization of the general theorem to = -. Proofs' main ideas: We give the main ideas behind the proof of Theorem 7. The proof of Theorem 8 follows similar techniques. The first-order optimality conditions imply that -2 ( ) + 2 ( ) + 1 + 2 + 3 ( ) = 0, where the Lagrange multipliers 1 , 2 , and 3 must be later chosen such that E{ ( )} = 0 , E{ ( )} = 1 , and E{ 2 ( )} = 2 . Using the change of variable ( ) = ˜ ( / √ 2) -1 / 3 -2 /( 3 -1) we obtain -2 ˜ ( ) + ˜ ( ) + 3 ˜ ( ) = 0 which is the Hermite ODE, which is well studied in physics, e.g. it appears in the study of the quantum harmonic oscillator. The cases 3 ∈ {0, 3} require special treatment. Using a finite energy/norm condition we can prove that 3 is quantized. In particular 3 = 4 , = 1, 2, ..., which implies that ( ) = -1 / 3 -2 /( 3 -2) + 2 ( / √ 2) , where is the th Hermite polynomial and a constant. The energy/norm is minimal when = 1, which implies a quadratic AF.

3.2. OPTIMAL ACTIVATION FUNCTION PARAMETERS

We will find AF parameters that minimize a linear combination of sensitivity and test error. We are interested in an asymptotic analytical treatment in the three regimes mentioned in Section 2. To be specific, we will compute U ( 1 , 2 , , , 1 , ★ , ) ≡ arg min 0 , 1 , 2 (1 -)E ∞ + S ∞ , where = 1, 2, or 3. ( ) We are not aware of previous work explicitly studying the trade-off between E and S for the RFR model. For the RFR model, the work of Mei & Montanari (2022) To simplify our exposition, we do not present results for the edge case = 1, for which problem (22) reduces to minimizing the sensitivity. Below we focus on the case when ∈ [0, 1).

Special notation:

In Theorem 9 we use the following special notation. Given two potential choices for AF parameters, say and , we define to mean that exists and that might exist or not, and that = if exists and it leads to a smaller value of (1 -)E + S than using , and otherwise = . Note that and make different statements about the existence of and . This notation is important to interpret the results of Table 1 in Theorem 9. Theorem 9. Let ∈ [0, 1), ≡ min{ 1 , 2 } and ≡ max{ 1 , 2 }. We have that U 1 = ( 0 , 1 , 2 ) : 2 1 (-1 + + ) = 2 ★ ( + ) , where is as in Table 1 . In Table 1 , 1 , 2 , and 3 are the smallest, second smallest and third smallest roots of a 4th degree polynomial ( ), specified in Appendix J, in the range ( , ) (-, min{0, 1 -}), if these exists. The variables 1 , 2 , 3 , , , , the polynomial ( ), and the conditions 1 and 2 are defined in Appendix J.2 when 1 < 2 , and when 1 > 2 these are defined in Appendix J.3. 1 ≤ 2 < < 1 3 < ≤ 2 ≤ 3 ( < ) ∧ 1 1 ( < ) ∧ 2 ∧ ( > ) 1 1 3 1 -- ( < ) ∧ 2 ∧ ( < ) 1 1 3 1 -- ( > ) ∧ 1 ∧ ( > ) -- ( > ) ∧ 1 ∧ ( < ) -- ( > ) ∧ 2 2 2 Table 1 : The optimal AFs (23) depends on according to this table. Cells with "--" never happen. The values of 1 , 2 , 3 , 1 , 2 , 3 , , , , , , and the events 1 and 2 are specified below. Remark 2. Excluding the = 1 scenario, it follows directly from (23) that the optimal AF is linear if and only if = . With this information and Table 1 , we have all the information needed to find exactly when the optimal AF is, or is not, linear. For regime 1 , changing alone can change the optimal AF from linear to non-linear and vice-versa (see e.g. 3rd column of Table 1 ), which justifies the observation 3 in Section 1. Remark 3. For the cases considered in Table 1 , is unique. When ∈ { , , }, or when ( 1 > 2 ) ∧ ( 1 ∈ { , }), ( & are defined in Appendix J. 3), we can lose the uniqueness of . Yet, we can still explicitly characterize the sets of optimal and of optimal AFs parameters. For simplicity we omit these cases from Thr. 9. Remark 4. Theorem 9's proof gives relationships among Table 1 's constants that imply that (1) no two rows/columns simultaneously hold and (2) in some cases some cells might not hold. See App. J. We do not consider 1 = 2 in Theorem 9 because it implies (1 -)E ∞ 1 + S ∞ 1 is not defined. Note that 1 = 2 has been called the interpolation threshold in the context of studying the generalization properties of the RFR model under vanishing regularization (Mei & Montanari, 2022) . See Remark 1. When ∈ { , } we can compute the optimal value of the objective explicitly. For example, if 1 < 2 and = the optimal value of the objective is (1-)( 2 ( 2 1 + 2 ★ )+ 1 2 ) 2 -1 . If 1 > 2 and = the optimal value of the objective is 2 1 ( 1 -2 )+ 2 ★ (( -1) 2 -)+ ( 1 -1) 2 -1 2 1 -2 if 1 < 1 and 2 1 ( (2 1 -1)( 1 -2 )+( 1 -1) 2 )+ 2 ★ (( -1) 2 -1 )-1 2 1 -2 if 1 ≥ 1. This follows by substitution. Theorem 10. Let ∈ [0, 1). We have that, U 2 = ( 0 , 1 , 2 ) : 2 1 (-1 + 2 2 -)(-1 + ) = -2( 2 ★ + 2 )(1 + ) , where is the unique solution to ( ) = 0 in the range ∈ (-1, min{1, -1 + 2 2 }), where ( ) 0 + 1 + 2 2 + 3 3 + 4 4 with coefficients described in Appendix K. Remark 5. The only way to get ★ = 0, and hence a linear optimal AF is if simultaneously satisfies 2 1 (-1 + 2 2 -)(-1 + ) = -2 2 (1 + ) and ( ) = 0. Since the first equation does not depend on , but the zeros of ( ) = 0 change continuously with , only very special choices of parameters lead to linear AFs. In general regime 2 does not have optimal linear AFs. Theorem 11. Let ∈ [0, 1). We have that U 3 =          {( 0 , 1 , 2 ) : ★ = 0 ∧ 1 = ∞} , if = 0 ∨ ( 1 = 1 ∧ 0 < ≤ 1 4 ) ( 0 , 1 , 2 ) : ★ = 0 ∧ 2 1 = -4 2 +3 + √ 16 2 -8 +1 , if 1 = 1 ∧ > 1 4 ( 0 , 1 , 2 ) : ★ = 0 ∧ 2 1 (-1 + 2 1 -)(-1 + ) + 2 1 (1 + ) = 0 , if 1 = 1 (25) where is the unique solution to ( ) = 0 in the range ∈ (-1, min{1, -1 + 2 2 }), where ( ) is define like in Theorem 10 but with → ∞ and with 2 replaced by 1 . Remark 6. The optimal AF is always linear and independent of the noise variables ★ and . Remark 7. When 1 = 1, ≤ 1 4 there is no optimal AF inside our AF search space since no AF can satisfy 1 = ∞. Rather, there exists a sequence of valid AFs with decreasing whose 1 → ∞. Remark 8. We can compute the optimal objective in closed-form in some scenarios. When = 0 the optimal objective is 2 ★ . When 1 = 1 ∧ 0 < ≤ 1 4 , the optimal objective approaches 2 1 + (1 -) 2 ★ as 1 → ∞. When 1 = 1 ∧ > 1 4 , the optimal objective is 2 1 (4 √ -1 -3 ) + 2 ★ (1 -). Proofs' main ideas: We give the main ideas behind the proof of Theorem 10. The proof of Theorems 9 and 11 follows similar techniques but require more care. The objective only depends on AF parameters via 2 = 2 ( 2 , 2 1 , 2 ★ ). We use the Möbius transformation = (1 + 2 )/( 2 -1) such that the infinite range 2 ∈ [-∞, 0] gets mapped to the finite range ∈ [-1, 1]. We then focus the rest of the proof on minimizing = ( ) over the range of . First we show that given that 2 1 , 2 ★ ≥ 0, 1 > 0, the range of can be reduced to ∈ [ , ] [-1, min{1, -1 + 2 2 }]. Then we compute d /d and d 2 /d 2 , which turn out to be rational functions of . We then show that if ∈ [-1, min{1, -1 + 2 2 }] then d 2 /d 2 > 0, so is strictly convex. We also show that d /d < 0 at and d /d > 0 at , thus and cannot be minimizers. Finally, we show that the zeros of the numerator ( ) of the rational function d /d differ from the denominator's zeros. So the optimal is the unique solution to ( ) in [ , ].

3.3. IMPORTANT OBSERVATIONS

Together, Sections 3.1 and 3.2 explicitly and fully characterize the solutions of (8) in the ridgeless, overparametrized, and large sample regimes. A few important observations follow from our theory. In Appendix L we discuss more on this topic and include details on the observations below. Observation 1: In regime 1 , and if = 0, 1 < 2 , the optimal AF is linear. This follows from Theorem 9 and Remark 2. Indeed, expressions simplify and we get that ( ) = 2 ( 1 -( + 1 ) 2 ) 2 if 1 = 1 or ( ) = -(2 + ) 2 2 if 1 = 1, which implies that 1 does not exist (since it would be outside of (-, )). Hence, the first row of Table 1 always gives = and the optimal AF is linear. Also, when = 0, 1 < 2 , we can explicitly compute the optimal objective (see paragraph before Theorem 10). If furthermore = ★ = 0, we can show that also when 1 > 2 , 1 does not exist and = , therefore the optimal objective and AF when 1 > 2 have the same formula as when 1 < 2 . Hence, if = = ★ = 0, 1 < 2 , from the formula one can conclude that choosing an optimal linear AF destroys the double descent curve if 2 > 1 1 , the test error becoming exactly zero for 1 ≥ 1. This contrasts with choosing a non-linear, sub-optimal, AF which will exhibit a double descent curve. This justifies observation 1 (low complexity 1 < 2 ) and observation 2 in Sec. 1. Fig. 1-(A,B ) illustrates this and details the high-complexity ( 1 > 2 ) observation.  ★ = 0 = 0 = 0 1 = 1 2 = 3 ReLU ★ 1 / 2 0 1 2 3 4 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 (B) ★ = 0 = 1 = 0 1 = 1 2 = 3 ReLu Linear ★ 1 / 2 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 (C) 1 = 1 ReLU 2 = 10 2 = 10 ★ = 0 = 0 ReLU ( ★ ) = ★ 2 = 5 ReLU ( ★ ) > ★ log 10 -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 (D) ★ = 0 ReLU = 0 1 = 10 2 = 10 2 = 5 ReLU ( ★ ) ★ log 10 Figure 1: (A) Consider the regime 1 . In a noiseless setting, if 2 > 1, the evolution of versus 1 , when an optimal linear AF ★ is used, can achieve 0 test error for 1 ≥ 1. However, if a non-linear ReLU is used, we observe the typical double descent curve. (B) Consider the regime 1 . If there is observation noise > 0, the evolution of versus 1 with a linear AF linear is only optimal for 1 < 2 . For 1 > 2 , is optimal for a linear AF until 1 < ( = 5 for the parameters here). For 1 > a non-linear AF ★ , here close to but different from a ReLU, achieves minimal . (C) Consider the regime 2 . When a ReLU is used (green curves), the evolution of versus for both low and high Signal to Noise Ratio (SNR) is only optimal for a special choice of , achieving the minimum ReLU ( ★ ). However, also for the same low and high SNR settings, when an optimal (non-linear) AF is used (orange curves), we obtain the same, or slightly better, ★ regardless of any careful choice for . For low SNR ( 2 = 10) we have ★ = ReLU ( ★ ) = 0.512 and for high SNR ( 2 = 5) we get ReLU ( ★ ) = 0.0220 > ★ = 0.0217. (D) In a situation just like in (C) but with even higher SNR, the difference between the minimum that can be achieved with a particular choice of (blue line ordinate value ReLU ( ★ )) and the value of with any choice of but with an optimal (non-linear) AF (orange line ordinate value ★ ) becomes clearly visible. Both (C) and (D) show that optimally tuning AFs can be different from optimally tuning regularization. Tuning AFs is always better or equal to tuning , showing the limits of the connection between AFs and implicit regularization when Gaussian equivalence holds (cf. Section 2.3). We include inside of each plot the parameters used. See Appendix L.1 for how to reproduce this figure. Observation 2: In regime 2 , looking at Theorem 2 and Theorem 5, one sees that both E and S, and hence the objective (cf. ( 5)), only depend on the optimal AF parameters via 2 . In particular, we can solve ( 22) by searching for the 2 that achieves the smallest objective. Given the definition of 2 = 2 ( 2 , 2 , ★ , ) in ( 32), fixing and changing or ★ always allows one to span a larger range of values for 2 than fixing the AF's parameters , ★ and changing . In particular, a tedious calculation shows that in the first case the achievable range for 2 is [ 2 min{0 -,-1+ 2 } , 0] which contains the range 1 If 2 < 1 the optimal AF is still linear but the explosion at the interpolation threshold 1 = 2 remains. in the second case which is 1 2 2 (-2 ) -2 2 2 ( 2 -2) + 2 + 2 + 2 + 1 + 2 + 1 , 0 . This implies that while for a fixed AF one needs to tune during learning for best performance, if an optimal AF is used, regardless of , we always achieve either equal or better performance. This justifies the observation 4 made in Section 1. This is illustrated in Figure 1-(C,D ). In Appendix N we have experiments involving real data that show consistency with these observations. The supplementary material has code to generate Fig. 1 and the figures in Appendix N for real data.

4. CONCLUSION AND FUTURE WORK

We found optimal Activation Functions (AFs) for the Random Features Regression model (RFR) and characterized when these are linear, or non-linear. We connected the best AF to use with the regime in which RFR operates: e.g. using a linear AF can be optimal even beyond the interpolation threshold; in some regimes optimal AFs can replace, and outperform, regularization tuning. We reduced the gap between the practice and theory of AFs' design, but parts remain to be closed. For example, we could only obtain explicit equations for optimal AFs under two functional norms in the optimization problem from which we extract them. One could explore other norms in the future. One could also explore adding higher order moment restrictions to the AF since some of these higher order constraints appear in the theoretical analysis of neural models (Ghorbani et al., 2021) . One open problem is determining, both numerically and analytically, how generic our observations are. One could numerically compute optimal AFs for several models, target functions, and regimes beyond the ones we considered here, and determine how the conditions under which the optimal AF is, or not, linear compare with the conditions we presented. We suspect that the choice of target function affects our conclusions. In fact, even for our current results, the amount of non-linearity in our target function affects our conclusions. In particular, it can affect the optimal AF being linear or not linear (this is visible in Theorem 9 in its dependency on , cf. ( 12), via the polynomial ( )). Another future direction would be to study optimal AFs when the first layer in our model is also trained, even if with just one gradient step. For this model there are asymptotic expressions for the test error (Ba et al., 2022) to which one could apply a similar analysis as in this paper. One could also study the RFR model under different distributions for the random weights, including the Xavier distribution (Glorot & Bengio, 2010) and the Kaiming distribution (He et al., 2015) . Some experimental results are included in Appendix O. Regarding the use of different distributions, we note the following: The key technical contribution in Mei & Montanari (2022) is the use of random matrix theory to show that the spectral distribution of the Gramian matrix obtained from the Regressor matrix in the ridge regression is asymptotically unchanged if the Regressor matrix is recomputed by replacing the application of an AF element wise by the addition of Gaussian i.i.d. noise. Because many universality results exist in random matrix theory, we expect that for other choices of random weights distributions, exactly the same asymptotic results would hold. The first thing to try to prove would be similar results for from well-known random matrix ensembles. We note that Gerace et al. (2020) provides very strong numerical evidence that this is true for other matrix ensembles, and stronger results are known in what concerns just spectral equivalence (Benigni & Pé ché, 2021) . One could study both the RFR and other models in regimes other than the asymptotic proportional regime. The work of Misiakiewicz ( 2022) is a good lead since it provides asymptotic error formulas derived for our setup but when ∼ poly( ), → ∞. In this regime, the RFR can learn nonlinear target functions with zero MSE (Misiakiewicz, 2022; Ghorbani et al., 2021) . These formulas are equivalent to the ones in Mei & Montanari (2022) after a renormalization of parameters, a reparametrisation that depend on a problem's constants, as noted in Lu & Yau (2022) . It is unclear if this reparametrisation would change the high-level observations from our work but we expect it to change their associated low-level details, like Table 1 's thresholds. It would be interesting to make these same investigations for more realistic neural architectures, such as Belkin et al. (2019a) and Nakkiran et al. (2019) , for which phenomena such as the double descent curve is well documented. Finally, it would be interesting to design AFs for an RFR model that optimizes a combination of test error and robustness to test/train distribution shifts and adversarial attacks. The starting point would be Tripuraneni et 

A COMMENT ON THE CHOICE OF METRICS

A few reason for us choosing norms (3) and (4) in our setup are the following. • We use the L1 and L2 norms because these are two of the most widely used functional norms. • We use these norms on a Gaussian-weighted space because the dependency of performance on the activation functions (AFs) from prior work that we build involves Gaussian-weighted measures. To be specific, both the sensitivity and error discussed in Section 2.1 and Section 2.2 depend on the AF only via 0 , 1 , and 2 , defined in (9), and these in turn are defined using a Gaussian distribution. In other words, Gaussian-weighted spaces are the natural space in our high-dimensional setting. • We focus on the derivative of the AF to impose the notion that the AF cannot have sudden changes, i.e. needs to be simple. We are aware that other choices for functional norms are possible, e.g. taking higher-order derivatives and/or higher-order moments of the AFs, and we plan to investigate them in future work as mentioned in Section 4.

B COMMENT ON THE CHOICES OF SENSITIVITY

This appendix pertains the choice of definition for sensitivity in (7). We want to relate E( ∇ x a ★ ,Θ (x) 2 ) with E(∇ x a ★ ,Θ (x)) 2 . The expectations can be taken just with respect to the test data x since in the asymptotic regime these quantities concentrate around their expected values with respect to the other random variables. The gradient ∇ x a ★ ,Θ (x) equals a ★ ( Θ/ √ ), where = diag( (Θx/ )), and for some vector v, diag(v) is a diagonal matrix with diagonal equal to v. We can write E ∇ x a ★ ,Θ (x) 2 = E ( ★ Θ / √ ) 2 = , , ★ ★ E( )Θ Θ / . ( ) Following D'Amour et al. ( 2020), Appendix E.5, when → ∞, we use the fact that θ x / √ → , where ∼ N (0, 1), and hence that E = E ( θ , x / √ ) = E( ( )) + (1) = 1 + (1), to compute E( ). We need to consider two scenarios. If = , then E = E E = 2 1 . If = , then E = E( ( )) 2 . Let us define 3 = E( ( ) 2 ). Replacing the formulas for E in (26) we get that E ∇ x a ★ ,Θ (x) 2 = 2 1 a ★ Θ 2 / + ( 3 -2 1 ) , ★ 2 Θ 2 , / . The first term of the r.d.s. of ( 27) is exactly eq. ( 19) on D'Amour et al. ( 2020) for which we are given asymptotic expression and which we use as the basis of Theorems 4-6, which are themselves a specialization of Theorem 13. The second term is non-negative since by Jensen's inequality 3 = E( ( ) 2 ) ≥ E( ( )) 2 = 2 1 . Therefore, E( ∇ x a ★ ,Θ(x) 2 ) ≥ E(∇ x a ★ ,Θ(x) ) 2 . The results we present, especially those in Section 3.2, are already extremely complex to state and to interpret, even with us essentially optimizing only two parameters, 1 and 2 ( 0 can be assumed 0 without loss of generality). Stating results on optimizing an extra parameter 3 would make our exposition even more complex, with many more special cases. Only in the specific case where we choose the objective (4) for the outer optimization problem (8), it clear that 3 is determined from 0 , 1 and 2 , and hence it is clear that there is no added complexity in the number of parameters to optimize. However, we still need to find an asymptotically formula for the second term in (27). At the same time, while the first term in ( 27) can be expressed as the trace of a product of random matrices, making it easier to use random matrix theory to get asymptotic formulas for it, the second term does not easily yield to a similar type of analysis.

C MORE ON RELATED WORK

First attempts to optimize AFs include Poli (1996) , Weingaertner et al. (2002), and Khan et al. (2013) , where genetic and evolutionary algorithms were used to learn how to numerically combine different AFs from a library into the same network. More recently, Ramachandran et al. (2017) used reinforcement learning to empirically discover AFs that minimize test accuracy. Their search was done over AFs that were a combination of basic units. This work produced the Swish AF. Similarly, Goyal et al. (2020) defined AFs as the weighted sum of a pre-defined basis and searched for optimal weights via training. Unser (2019) provided a theoretical foundation to simultaneously learn a NN's weights and continuous piecewise-linear AFs. They showed that learning in their framework is compatible with learning in current existing deep-ReLU, parametric ReLU, APL (adaptive piecewiselinear) and MaxOut architectures. Tavakoli et al. (2021) parameterized continuous piece-wise linear AFs and numerically learnt their parameters to improve both accuracy and robustness to adversarial perturbations. They numerically compared the performance of their SPLASH framework with that of using ReLUs, leaky-ReLUs (Maas et al., 2013) , PReLUs (He et al., 2015) , tanh units, sigmoid units, ELUs (Clevert et al., 2015) , maxout units (Goodfellow et al., 2013) , Swish units, and APL units (Agostinelli et al., 2014) . Similarly, Zhou et al. ( 2021) parameterized AFs as piece-wise linear units and learnt the AFs parameters to optimize different tasks. Banerjee et al. (2019) proposed an empirical method to learn variations of ReLUs. Bubeck et al. (2020) studied 2-layer NNs and gave a condition on the Lipschitz constant of a polynomial AF for the network to perfectly fit data. They related this condition to the model's parameter-size and robustness and numerically related the number of ReLUs in the model to its robustness. Several papers proposed new AFs and empirically studied their performance without systematically tuning them. Milletarí et al. (2019) identified ReLU and Swish as naturally arising components of a statistical mechanics model. Rozsa & Boult (2019) introduced a "tent"-shaped AF that improves robustness without adversarial training, while not hurting the accuracy on non-adversarial examples. Zhou et al. ( 2020) proposed an AF called SRS that can overcome the non-zero mean, negative missing, and unbounded output in ReLUs. Their work was purely empirical. Wuraola & Patel (2018) developed the SQuared Natural Law AF. Nicolae (2018) proposed the Piece-wise Linear Unit AF. The RFR model was introduced by Rahimi & Recht (2007a) as a way to project input data into a low dimensional random features space and it has since then been studied considerably. A great part of the literature has drawn connections between the expressive power of NNs and that of the RFR model, often via the study of Gaussian processes. For example, Williams (1996) In addition to Mei & Montanari (2022) ; D'Amour et al. ( 2020), already discussed, other papers studied the approximation properties of the RFR model. Ghorbani et al. (2021) studied both the RFR model and the neural tangent kernel model and provided conditions under which these models can fit polynomials in the raw features up to a maximum degree. These conditions were provided under two regimes, when → ∞ and , large but finite, or when → ∞ and , large but finite. Their results hold under weak assumptions on the AFs. Tripuraneni et al. (2021) used the RFR model to compute how robust the test error is to distribution shifts between training and test data. This was done in a high-dimensional asymptotic limit when random features and training data are normal distributed. The derivations hold for a generic AF that satisfies some mild assumptions similar to the assumptions in this paper. Hassani & Javanmard (2022) characterized the role of overparametrization on the adversarial robustness for the RFR model under an asymptotic regime when learning a linear function with normal-distributed random weights and normal samples. Their AF was a shifted ReLU. Finally, a few papers have studied the behavior of models similar to the RFR but within a different context. For example, Taheri et al. (2021) and Bean et al. (2013) seek to compute the optimal loss function under similar asymptotic regimes of large data sets.

D DETAILS REGARDING THEOREM 1

The definition of the functions E 0,rless , E 1,rless , E 2,rless and is as follows. E 0,rless ( , 1 , 2 , ) ≡ -5 6 + 3 4 4 + ( 1 2 -2 -1 + 1) 3 6 -2 3 4 -3 3 2 + ( 1 + 2 -3 1 2 + 1) 2 4 + 2 2 2 + 2 + 3 1 2 2 -1 2 , E 1,rless ( , 1 , 2 , ) ≡ 2 3 4 -2 2 2 + 1 2 2 -1 2 , and E 2,rless ( , 1 , 2 , ) ≡ 5 6 -3 4 4 + ( 1 -1) 3 6 + 2 3 4 + 3 3 2 + (-1 -1) 2 4 -2 2 2 -2 , where = ( , ) ≡ -( 2 -2 -1) 2 + 4 2 + 2 -2 -1 2 2 . ( ) Note that all functions are a function of the square of , i.e. 2 . Note also that E 0,rless , E 1,rless , E 2,rless are polynomials of their respective variables. E DETAILS REGARDING THEOREM 2 B wide ( , 2 , ) ≡ ( 2 2 -2 )/(( 2 -1) 2 3 + (1 -3 2 ) 2 2 + 3 2 2 -2 ), V wide ( , 2 , ) ≡ ( 2 3 -2 2 )/(( 2 -1) 2 3 + (1 -3 2 ) 2 2 + 3 2 2 -2 ), (31) 2 ≡ -( 2 2 -2 -2 -1) 2 +4 2 2 ( 2 + 1) + 2 2 -2 -2 -1 2( 2 + 1) . F DETAILS REGARDING THEOREM 3 B lsamp ( , 1 , )≡((( 1 3 -1 2 )/ 2 ) + 1 1 -1 )/(( 1 -1) 1 3 + (1-3 1 ) 1 2 + 3 1 1 -1 ), where (34) 1 ≡ -( 1 2 -2 -1 -1) 2 + 4 1 2 ( 1 + 1) + 1 2 -2 -1 -1 2( 1 + 1) .

G GENERAL THEOREMS FOR THE ASYMPTOTIC MEAN SQUARED TEST ERROR AND SENSITIVITY OF THE RFR MODEL

This appendix is referenced at the start of Section 2.1 and at the start of Section 2.2. Theorem 12 (Theorem 2 Mei & Montanari (2022) ). If assumptions 1, 2, 3 hold, and for any value of the regularization parameter > 0, the asymptotic test error (6) for the RFR satisfies E -→ E ∞ ≡ 2 1 B( , 1 , 2 , / 2 ★ ) + ( 2 + 2 ★ )V ( , 1 , 2 , / 2 ★ ) + 2 ★ , where -→ denotes convergence in probability when → ∞ with respect to the training data D, the random features Θ, and the random target function , and where B( , 1 , 2 , ) ≡ E 1 ( , 1 , 2 , )/E 0 ( , 1 , 2 , ), V ( , 1 , 2 , ) ≡ E 2 ( , 1 , 2 , )/E 0 ( , 1 , 2 , ), and the functions E 0 , E 1 , E 2 are defined as follows, E 0 ( , 1 , 2 , ) ≡ -5 6 + 3 4 4 + ( 1 2 -2 -1 + 1) 3 6 -2 3 4 -3 3 2 + ( 1 + 2 -3 1 2 + 1) 2 4 + 2 2 2 + 2 + 3 1 2 2 -1 2 , E 1 ( , 1 , 2 , ) ≡ 2 3 4 -2 2 2 + 1 2 2 -1 2 , E 2 ( , 1 , 2 , ) ≡ 5 6 -3 4 4 + ( 1 -1) 3 6 + 2 3 4 + 3 3 2 + (-1 -1) 2 4 -2 2 2 -2 , where ( 1 , 2 , ) ≡ 1 (i( 1 2 ) 1/2 ) • 2 (i( 1 2 ) 1/2 ), and 1 and 2 are two functions specified in Def. 1 in Mei & Montanari (2022) . Theorem 13 (Eq. ( 19) D'Amour et al. ( 2020)). If assumptions 1, 2 and 3 hold, and for any value of the regularization parameter ≡ / 2 ★ > 0, the asymptotic sensitivity for the RFR, namely equation (7) with = a,Θ where a,Θ is defined as in equations (1) and (2), satisfies S -→ S ∞ ≡ 2 1 2 1 2 ★ • D 1 ( , 1 , 2 , ) ( 2 -1)D 0 ( , 1 , 2 , ) + 2 + 2 ★ 2 ★ • D 2 ( , 1 , 2 , ) D 0 ( , 1 , 2 , ) , where -→ denotes convergence in probability when → ∞ with respect to the training data X, ε, the random features Θ, and the random target function , and where D 0 ( , 1 , 2 , ) = 5 6 -3 4 4 + ( 1 + 2 -1 2 -1) 3 6 + 2 3 4 + 3 3 2 (38) + (3 1 2 -2 -1 -1) 2 4 -2 2 2 -2 -3 1 2 2 + 1 2 , D 1 ( , 1 , 2 , ) = 6 6 -2 5 4 -( 1 2 -1 -2 + 1) 4 6 + 4 4 (39) + 4 2 -2(1 -1 2 ) 3 4 -( 1 + 2 + 1 2 + 1) 2 2 -2 , D 2 ( , 1 , 2 , ) = -( 1 -1) 3 4 -3 2 + ( 1 + 1) 2 2 + 2 , ( ) and where is defined in (28). Remark 9. Eq. ( 19) in D'Amour et al. ( 2020) is expressed in terms of the asymptotic limit of a ★ Θ 2 . See Appendix B for the connection between this representation and equation ( 7) in our definition of S. H DETAILS REGARDING THEOREM 4 D 0,rless ( , 1 , 2 ) = 5 6 -3 4 4 + ( 1 + 2 -1 2 -1) 3 6 + 2 3 4 + 3 3 2 (41) + (3 1 2 -2 -1 -1) 2 4 -2 2 2 -2 -3 1 2 2 + 1 2 , D 1,rless ( , 1 , 2 ) = 6 6 -2 5 4 -( 1 2 -1 -2 + 1) 4 6 + 4 4 (42) + 4 2 -2(1 -1 2 ) 3 4 -( 1 + 2 + 1 2 + 1) 2 2 -2 , D 2,rless ( , 1 , 2 ) = -( 1 -1) 3 4 -3 2 + ( 1 + 1) 2 2 + 2 , ( ) where is defined as in (28), and ≡ min{ 1 , 2 }.

I DETAILS REGARDING THEOREM 8

The formula for ( ) is as follows: ( ) = 2 /2 erf 2 / √ 2 2 /2 1 -erf / √ 2 2 + erf / √ 2 -2/ . J DETAILS REGARDING THEOREM 9

J.1 RELATIONSHIP AMONG THE CONSTANTS IN TABLE 1

This appendix is referenced in Remark 4. The following relationships can be derived either from the proof of Theorem 9, or from a direct computation based on the formulas given in Theorem 9. They do not add critical information to what is proved in Theorem 9 but they give general rules to exclude some cells in Table 1 . Remark 10. Let 1 < 2 . If 2 < min{2 1 , 1 + 1}, then < < . In this case, the 4th and 5th rows of Table 1  never happen. If 2 > min{2 1 , 1 + 1}, then < < . If 2 = min{2 1 , 1 + 1}, then = = . We always have that , , ∈ [0, 1], and 0 ≤ 3 < 2 < 1 . Since 1 and 2 cannot simultaneously hold, no two rows can simultaneously hold. Since 3 < 2 < 1 , no two columns can simultaneously hold. Remark 10 follows from Lemma M.13 used in the proof of Theorem 9.

K DETAILS REGARDING THEOREM 10

The coefficients of ( ) are given below 0 = 8 2 -1 +( +4 2 (-1+2 2 )(-1+2 )), ( ) 1 = 8 2 -1 +4((1-4 2 ) +2 2 2 ), ( ) 2 = -2(-3 + 2 (2 + 4 )), ( ) 3 = 4 , and 4 = . (80)

L IMPORTANT OBSERVATIONS

This appendix is referenced in Section 3.3.

L.1 REPRODUCING FIGURE 1

The plots in Figure 1 come directly from our theory. They involve no experiments and can be obtained using many standard mathematical computing software tools. Nonetheless, since it does require some effort to code our equations, we include code to generate Figure 1 in the following Github link: https://github.com/Jeffwang87/RFR_AF. This code is also available in the supplementary zip file provided. To generate the plots in Figure 1 , run the file named RunMeToGenerateFigure_1.nb. It runs using Wolfram Mathematica V12. We ran it using a MacBook Pro with 2.6 GHz 6-Core Intel Core i7 and 32 GB 2667 MHz DDR4. In this machine it takes about 1 second to run. L.2 SOME DETAILS ABOUT OBSERVATION 1 IN SECTION 3.3 When = 0 and 1 < 2 , and using Theorem 9, we can explicitly compute the objective = E 1 = ( 2 ★ 2 + 2 1 max{1 -1 , 0} 2 + 1 2 )/( 2 -1 ) if 1 < 2 . This follows from making = 0 in the expressions in the paragraph before Theorem 10. Observe that, as function of 1 , the test error E 1 achieves a minimum of ( 2 ★ 2 + 2 )/(-1 + 2 ) at 1 = 1 if 2 > 1 + 1/ and a minimum of ( 2 1 2 + 2 ★ 2 )/ 2 at 1 = 0 otherwise. In particular, if = ★ = 0 + , then the condition 2 > 1 + 1/ becomes 2 > 1. When = 0 and 1 > 2 the polynomial that determines 1 becomes ( ) = 2 -4 2 3 -2 ( 1 + 2 (3 2 -2)) 2 -4 -2 ( 2 -1) ( 1 + 2 (2 2 -1)) -( 2 -1) 2 2 2 . ( ) One can show that in the range 1 > 2 and ∈ ( , ) = (-2 , min{0, 1 -2 }), we have ( ) < 0 and hence it has no roots. Therefore, 1 does not exist. Therefore, if furthermore we have = ★ = 0, then condition 2 is never true, so must be read from the first row in Table 12 , which implies that = , and the optimal AF is linear also when 1 > 2 .

M PROOFS

This appendix is referenced in Section 3.

M.1 PROOF OF SENSITIVITY PROPERTIES OF SECTION 2.2

Proof of Theorem 4. The proof follows directly from Theorem 13 by taking the limit when → +0. Proof of Theorem 5. The proof follows directly from Theorem 13 by taking the limit when 1 → +∞. Proof of Theorem 6. The proof follows directly from Theorem 13 by taking the limit when 2 → +∞.

M.2 PROOF OF NECESSARY AND SUFFICIENT CONDITION FOR LINEARITY OF SECTION 3.1

Proof of Lemma 3.1 . If ( ) is linear function, a direct calculation shows that ★ = 0. On the other hand, since 0 ≤ E( ( ) -0 -1 ) 2 = 2 ★ , we have that ★ = 0 implies that ( ) = 0 + 1 almost surely. Since the support of the probability density function of is R, it follows that ( ) = 0 + 1 except on a set of measure zero in R.

M.3 PROOF OF THEOREM 7

To prove Theorem 7, we first need to state and prove a series of intermediary results. Lemma M.1. A necessary condition for optimally of ( ) is that -2 ( ) + 2 ( ) + 1 + 2 + 3 ( ) = 0, ( ) where 1 , 2 , and 3 satisfy the following constraints, 1 + 3 0 = 0, (83) -2 1 + 2 + 3 1 = 0, ( ) -2E(( ( )) 2 ) + 1 0 + 2 1 + 3 2 = 0, and that lim →+∞ ( ( )) 2 -2 2 = lim →-∞ ( ( )) 2 -2 2 = 0. ( ) Remark 12. Note that since ( 82) is a second order ODE, its solutions are parametrized by two constants in addition to being parametrized by 1 , 2 , 3 . These five constants are set by our three constraints ( 19) together with two boundary conditions from our variational problem, which are lim →±∞ ( ) -2 2 = 0. These last two we replace (see proof) by ( 86). Remark 13. The lemma implies that knowing 0 , 1 , 2 and the objective value E(( ( )) 2 ) is enough to determine 1 , 2 , 3 , even without solving the ODE. Proof of Lemma M.1. Derivation of equation (82): A Lagragian for ( 19)-( 19) is E(( ( )) 2 + 1 ( 0 -( )) + 2 ( 1 - ( )) + 3 ((1/2)( 2 -( )) 2 )), which can be written as ∞ -∞ ( , , )d where, if we define ( ) = 1 √ 2 -2 2 , the Lagrangian density is ( , , ) = ( )( 2 + 1 ( 0 -) + 2 ( 1 -) + 3 (1/2)( 2 -2 )). We use the Euler-Lagrange equation with free boundary conditions Gelfand et al. (2000) to get the necessary condition (82). To do so, we compute in sequence = -( )( 1 + 2 + 3 ), (89) = ( )(2 ), d d = ( )(2 ) + ( )(2 ) = ( )(-2 + 2 ). These lead to 0 = d d - = ( )(-2 + 2 + 1 + 2 + 3 ) and ( 92) 0 = lim →±∞ = lim →±∞ ( ) ( ), where the last condition follows from the fact that there is no boundary condition on . Since ( ) > 0, equation ( 92) implies (82). Derivation of equations (86): Since we are working with necessary conditions, we can choose not list (93) in our lemma. Rather, we include condition (86), which a consequence of the fact that E(( ( )) 2 ) must be finite. Indeed, E(( ( )) 2 ) = ∞ -∞ ( )( ( )) 2 < ∞ implies that ( )( ( )) 2 must vanish at ±∞. Although not necessary for this proof, note that, since ( ) goes to zero as → ±∞, equations (86) imply equations (93).

Derivation of equations (83)-(85):

Equation ( 82) must hold for all . Hence, we can replace in it by a standard normal random variable and compute the expected value of both of its sides. This leads to -2 E( ( )) + 2 E( ( )) + 1 + 3 0 = 0. ( ) We can also multiply (82) by , replace by a standard normal random variable and compute the expected value of both of its sides. This leads to, -2 E( 2 ( )) + 2 E( ( )) + 2 + 3 1 = 0. Finally, we multiply (82) by ( ), replace by a standard normal random variable and compute the expected value of both of its sides. This leads to, -2 E( ( ) ( )) + 2 E( ( ) ( )) + 1 0 + 2 1 + 3 2 = 0. ( ) Using integration by parts, and the fact that for ( ) = 1 √ 2 -2 2 we have that ( ) = -( ) and ( ) = (1 -2 ) ( ), we derive the following useful relationships E( ( )) = E( ( )), E( ( )) = E( 2 ( )) -E( ( )), E( ( )) = 1 , E( ( ) ( )) = E( ( ) ( )) -E(( ( )) 2 ). To derive these relationships via integration by parts we made use of the following relationships lim →±∞ ( ) ( ) ( ) = lim →±∞ ( ) ( ) = lim →±∞ ( ) ( ) = lim →±∞ ( ) ( ) = 0, which can be proved from E(( ( )) 2 ), E(( ( )) 2 ) < ∞. Also, from E(( ( )) 2 ), E(( ( )) 2 ) < ∞ and 1 ∈ R, we can show that each of the expected values in the right-hand-side of ( 97)-( 100) is well-defined. Hence, the left-hand-side of ( 97)-( 100) is well defined. To finish the proof we replace (97) in (94) to get 1 + 3 0 = 0, which is equation ( 83). Then we replace ( 98) and ( 99) in (95) to get -2 1 + 2 + 3 1 = 0, which is equation ( 84). Lastly, we replace (100) in ( 96) to get -2 E(( ( )) 2 ) + 1 0 + 2 1 + 3 2 = 0, which is equation ( 85). Lemma M.2. The solutions of (82) are of the form ( ) = ¯ ( ) + 1 3 2 √ 2 + 2 { -3 4 }, { 1 2 } 2 2 , ( ) where ¯ ( ) = 2 2 - 1 2 4 {1,1}, { 3 2 ,2} 2 2 , if 3 = 0 (103) ¯ ( ) = - 1 2 - 2 2 + 1 2 2 2 2 2 erf √ 2 - 1 4 2 3 {1,1}, { 3 2 ,2} 2 2 , if 3 = 2 (104) ¯ ( ) = - 1 3 - 2 3 -2 , if 3 / ∈ {0, 2} where erf is the Gauss error function , { }, { } is the generalized hypergeometric function with parameters { }, { }, and is the Hermite polynomial, extended to a possibly non-integer . Remark 14. By an extension of ( ) to non-integer we mean that ( ) = {-1 2 ( -1)}, { 3 2 } ( 2 ), which is defined for non-integer . Proof of Lemma M.2. Since ( 82) is a second order ODE, its solutions are spanned by any particular solution ¯ ( ) plus a linear combination of two solutions to its associated homogeneous ODE. The homogenous ODE associate with ( 82) is -2 ( ) + 2 ( ) + 3 ( ) = 0. The change of variable ( ) = ˜ ( / √ 2) allows us to get -2 ˜ ( ) + ˜ ( ) + 3 ˜ ( ) = 0, which is a well-known ODE called the Hermite differential equation. This implies that the homogeneous solutions of our ODE are spanned by 3 2 √ 2 and { -3 4 }, { 1 2 } 2 2 . The particular solutions ¯ ( ) can be confirmed by direct substitution. Lemma M.3. Let be of the form (102). If 3 = 0, then E(( ( )) 2 ) < ∞ implies that ( ) is of the form ( ) = + 2 2 , ( ) for some , and that E(( ( ) ) 2 ) = 2 2 4 . Proof of Lemma M.3. If 3 = 0 then based on Lemma M.2 we can re-write that ( ) = 2 1 erfi √ 2 + 2 - 1 4 1 2 2 -2 2 , ( ) for some 1 , 2 , where erfi is the inverse of erf. From this it follows that, ( ) = 1 4 2 2 4 1 - √ 2 1 erf √ 2 + 2 2 . ( ) The objective E(( ( )) 2 ) < ∞ implies that lim →±∞ ( ( )) 2 -2 2 = 0. Since lim →±∞ erf( ) = ±1, we have that lim →±∞ ( ( )) 2 -2 2 = 0 implies that 4 1 = ± √ 2 1 for both ± simultaneously. This implies that 1 = 1 = 0. Substituting 1 = 1 = 0 in (107) and simplifying we get ( ) = 2 + 2 2 . From this expression one can then directly compute E(( ( )) 2 ). Lemma M.4. Let be of the form (102). If 3 = 2, then E(( ( )) 2 ) < ∞ implies that ( ) is of the form ( ) = - 1 2 , ( ) for some , and that E(( ( )) 2 ) = 2 . Proof of Lemma M.4. If 3 = 2 then based on Lemma M.2 we can re-write that ( ) = 1 4 - 2 2 {1,1}, { 3 2 ,2} 2 2 + 2 + 2 √ 2 √ 2 erfi √ 2 -2 1 + 2 2 √ 2 2 erf √ 2 + 4 2 -2 1 (110) for some 1 and 2 , where erfi is the inverse of erf. From this it follows that ( ) = 2 1 - √ 2 erfi √ 2 √ 2 - 1 4 2 2 {1,1}, { 3 2 ,2} 2 2 . ( ) The objective E(( ( )) 2 ) < ∞ implies that lim →±∞ ( ( )) 2 -2 2 = 0. Since lim →±∞ erfi √ 2 2 = lim →±∞ 2 {1,1}, { 3 2 ,2} 2 2 2 = ∞, and since lim →±∞ 2 {1,1}, { 3 2 ,2} 2 2 erfi √ 2 = ± , it follows we only have a finite objective if - √ 2 √ 2 = ± 2 4 for both ± simultaneously. But this implies that 2 = 2 = 0. Substituting 2 = 2 = 0 in (110), and simplifying, leads to ( ) = 1 √ 2 -1 2 . The √ 2 factor can be absorbed by 1 . From this expression one can compute E(( ( )) 2 ). Lemma M.5. Let be of the form (102). If 3 / ∈ {0, 2}, 1 = 2 = 0 then ( ) = - 1 3 - 2 3 -2 , ( ) and E(( ( )) 2 ) = 2 ( 3 -2) 2 . ( ) Proof of Lemma M.5. This follows directly from Lemma M.2. Lemma M.6. Let be of the form (102). If 3 / ∈ {0, 2}, and either 1 or 2 are non-zero, then E(( ( )) 2 ) < ∞ implies that 3 = 4 for ∈ Z + . Proof of Lemma M.6. If 3 / ∈ {0, 2}, then by Lemma M.2 we have a formula for ( ) from which we get that ( ) = 1 3 3 2 -1 √ 2 √ 2 - 1 2 2 3 {1-3 4 }, { 3 2 } 2 2 - 2 3 -2 , ( ) where and are as defined in Lemma M.2. The boundeness of E(( ( ) ) 2 ) is dependent on how fast 3 2 -1 √ 2 and {1-3 4 }, { 3 2 } 2 2 grow as → ±∞. If 1 = 0 and 2 = 0, E(( ( )) 2 ) < ∞ only if E {1-3 4 }, { 3 2 } 2 2 2 < ∞, which in turn is true only if lim →∞ {1-3 4 }, { 3 2 } 2 2 2 -2 2 = 0. This implies that 3 is even. If 1 = 0 and 2 = 0, E(( ( )) 2 ) < ∞ only if E 3 2 -1 √ 2 2 < ∞, which in turn is true only if lim →-∞ 3 2 -1 √ 2 2 -2 2 = 0. This implies that 3 is even. If 1 , 2 = 0, E(( ( )) 2 ) < ∞ only if lim →∞ ( ( )) 2 -2 2 = 0. But {1-3 4 }, { 3 2 } 2 2 grows much faster than 3 2 -1 √ 2 as → ∞, hence it must be that lim →∞ {1-3 4 }, { 3 2 } 2 2 2 -2 2 = 0. This implies that 3 is even. Lemma M.7. Let 3 = 4 for ∈ Z + , and let ( ) be a solution of (82), then ( ) = - 1 3 - 2 3 -2 + 2 √ 2 (116) for some and E(( ( )) 2 ) = 4 ((2 )! ) 2 (2 -1)! 2 + 2 3 -2 2 . ( ) Proof of Lemma M.7. If 3 = 4 then { -3 4 }, { 1 2 } 2 /2 = (-1) ! (2 )! 3 /2 ( / √ 2), and hence the homogenous part of ( ) can be written as 3 /2 ( ), from which (116) follows. From this expression it follows that ( ) = 3 3 2 -1 √ 2 √ 2 - 2 3 -2 , ( ) and from this expression the value for E(( ( )) 2 ) follows. Lemma M.8. Let be of the form (102). If 3 / ∈ {0, 2}, then E(( ( )) 2 ) < ∞ implies that ( ) is of the form ( ) = - 1 3 - 2 3 -2 + 3 /2 √ 2 (120) for some that must be zero if 3 = 4 , ∈ Z + , and E(( ( )) 2 ) = 4 ((2 )! ) 2 (2 -1)! 2 + 2 3 -2 2 . ( ) Proof of Lemma M.8. This follows directly from Lemmas M.5-M.7. Lemma M.9. If 1 , 2 , 3 satisfy (83)-( 85), then 2 3 -2 2 = 2 1 . ( ) Proof of Lemma M.9. Defining = E(( ( )) 2 ) and solving the linear system ( 83)-( 85) leads to 1 = 2 0 -2 1 2 0 + 2 1 -2 , 2 = 2 1 2 0 -2 + 2 0 + 2 1 -2 , 3 = - 2 -2 1 2 0 + 2 1 -2 . ( ) Computing 2 /( 3 -2) we get -1 , from which the first relation follows. If we solve 3 = - 2( -2 1 ) 2 0 + 2 1 -2 for and recall that 2 ★ = 2 -2 0 -2 1 , the second relation follows. We are now ready to prove Theorem 7, which we restate here for convenience. Theorem ( 7). The minimizers of (19) are ( ) = 2 + + , ( ) where = ± ★ √ 2 , = 1 , and = 0 -. ( ) Proof of Theorem 7. We will show that any minimizer of ( 19) must be a function ( ) = 2 + + for some , , . From this fact, the variational problem can be reduced to a simple quadratic programming problem over , , , from which it is straightforward to derive ( 124) and ( 125). If ★ = 0 then by Lemma 3.1, we know that the solution must be linear, and hence we are done. If ★ > 0 then by Lemma 3.1 we know that ( ) cannot be a linear function. Hence, from Lemma M.3 and Lemma M.4 we know that 3 cannot be 0 or 2. Therefore, its solution must be of the form specified by Lemma M.8 with = 0 and 3 = 4 for some ∈ Z + . Define = E(( ( )) 2 ). From the constraints (19) we know that 1 , 2 and can be written as a linear function of 3 = 4 . In particular, = 2 1 + 3 2 2 ★ . From ( 121) and Lemma M.9 we can write that = 4 ((2 )!) 2(2 -1)! 2 + 2 1 , which implies that = ± -2 1 4 ((2 )!) 2 (2 -1)! is also a function of 3 . Therefore, the solution to is parametrized by 3 alone which must be chosen to minimize = 2 1 + 3 2 2 ★ . Therefore, we must choose 3 = 4, the smallest possible multiple of 4. This implies that 3 /2 ( / √ 2) is a quadratic function, from which it follows that ( ) is also a quadratic function, and hence we are done.

M.4 PROOF OF THEOREM 8

Before we prove Theorem 8, we will state and prove a series of intermediary results. Lemma M.10. A necessary condition for optimality of ( ) is that + 1 + 2 + 3 ( ) = 0, for all : ( ) = 0, ( ) where 1 , 2 , and 3 must be such that, E( ( )) = 0 , E( ( ) ) = 1 , E(( ( )) 2 ) = 2 . ( ) Proof of Lemma M.10. A Lagragian for ( 126) is E(| ( )|+ 1 ( 0 -( )) + 2 ( 1 - ( )) + 3 ((1/2)( 2 -( )) 2 )), which can be written as ∞ -∞ ( , , )d where, if we define ( ) = 1 √ 2 -2 2 , the Lagrangian density is ( , , ) = ( )(| |+ 1 ( 0 -) + 2 ( 1 -) + 3 (1/2)( 2 -2 )). ( ) Any variation ( )+ ( ) of an optimal ( ) must yield E(| ( )+ ( )|) ≥ E(| ( )|). In particular, this must be the case for any variation such that ( ) = 0 whenever ( ) = 0. If we focus on these variations, from E(| ( ) + ( )|) ≥ E(| ( )|) the Euler-Lagrange equation can be derived despite |•| not being differentiable at 0. To be specific, it must hold that d d - = 0 ∀ : ( ) = 0. Since there are no fixed boundary conditions on our integration domain (-∞, ∞), it also needs to hold that lim →±∞ = lim →±∞ ( ) ( ) = 0, which we choose not to list in our necessary conditions. Since d d = 0 if ( ) = 0, equation ( 132) implies ( 126). First order optimality conditions imply that the Lagrange multipliers must be choose such that E( ( )) = 0 , E( ( ) ) = 1 , E(( ( )) 2 ) = 2 . Lemma M.11. If 3 = 0, the solutions of (126) are of the form ( ) = - 1 3 - 1 + 2 3 min{max{ , }, }, for some constants , where < if 1+ 2 3 ≤ 0 and > otherwise. Proof of Lemma M.11. Since is a one-dimensional function, the requirement of existence of weak derivative implies that there exists absolute continuous that agrees with almost everywhere. We will work with these absolute continuous representations of . Other solutions can differ from the absolute continuous solutions only up to a set of measure zero with respect to the Gaussian measure. From (126) we know that wherever ( ) = 0, we have that ( ) = + for the same fixed and . Hence, since is continuous, must be an alternation of flat portions and portions with the same slope . Because of continuity, we cannot have a flat portion interrupt portion of slope (unless = 0), as illustrated in Figure 2 -right, and must be as in the other two cases in Figure 2 . These have a form as in (133). Lemma M.12. Let be of the form (133). Then, E(| ( )|) = 1 2 1 + 2 3 P( ∈ [ , ]), ( ) 1 E( ( )) = - 1 2 1 + 2 3 P( ∈ [ , ]), where [ , ] should be interpreted as Proof of Lemma M.12. From Lemma M.11, we have explicit formulas for and . The proof boils down to a direct calculation of the expected values, which themselves boil down to computing a few Gaussian integrals. [ , ] if < . In particular, E(| ( )|) = | 1 |. We are now ready to prove Theorem 8, which we restate below for convenience. Theorem (8). One minimizer of (3) is ( ) = 0 + min{max{ , -}, } where = 1 erf( / √ 2) , erf is the Gauss error function, and ∈ R is the unique solution to the equation 2 2 1 2 ★ = 2 2 erf √ 2 2 2 2 1 -erf √ 2 2 + erf √ 2 -2 , ( ) if ★ = 0, and = ∞ if ★ = 0. Proof of Theorem 8. From Lemma M.11, we know that if 3 = 0, then any minimizer must have the form (133). From Lemma M.12 we know that all of the functions of this form have the same objective. Hence, if 3 = 0, all of the functions of the form (133) that satisfy constraints (19) are a global minimizer . We set 3 = 1 = 0. To satisfy the three constraints (19) we have 4 remaining values to play with, namely 1 , 2 , , . Hence, we set = -. With a reparameterization, this leads to having the form ( ) = + min{max{ , -}, }, where has a new meaning. That is, is flat outside of the interval [-/| |, /| |], ≥ 0, and inside of this interval it has slope . The goal is to find , , from the constraints (19). From ( ) = + min{max{ , -}, } a direct computations leads to 0 = , (138) 1 = erf √ 2 , ( ) 2 = 2 2 -1 erfc √ 2 + 2 1 - 2 -2 2 + 2 , where erfc = 1erf. We can use the first two equations to write and as a function of 0 , 1 , . Substituting and with these functions in the third equation, and simplifying, leads to 2 = 2 0 - 2 1 -2 -1 erfc √ 2 + 2 -2 2 -1 erf √ 2 2 . ( ) Recalling that 2 = 2 ★ + 2 1 + 2 0 , replacing this definition into the above equation, and simplifying leads to 2 1 2 ★ 2 = - 2 2 erf √ 2 2 2 2 erf √ 2 -1 2 + erf √ 2 + 2 . ( ) One can show that the function on the right hand side ( 142) is monotonic increasing in ∈ [-∞, ∞] with range [0, ∞], which implies that there is only one that solves (142).

M.5 PROOF OF THEOREM 9

This proof involves heavy algebraic computations. To aid the reader, this paper is accompanied by a Mathemetica file that symbolically checks the equations both in the theorem statement as well as in the proof below. This file is in the supplementary zip file, as well as in the following Github link https://github.com/Jeffwang87/RFR_AF. It is called RunMeToCheckProofOfTheorem9.nb. Proof of Theorem 9. Theorem 9 amounts a statement about the solutions of the optimization problem ( 22) for regime 1 . Its proof amounts to studying the local minima of the objective via its first and second derivatives, both on the inside and on the boundary of the variables' domain. We will prove the theorem for 1 < 2 and 1 > 2 separately. For 1 = 2 the objective is not defined. In what follows, we let = (1 -)E ∞ 1 + S ∞ 1 . We will use the fact that is a one-dimensional function of 2 ∈ [0, +∞], as can be seen from ( 13) and ( 16). We will use this and the fact that (28) defines a monotonic function between and 2 , to express as a function of and study the solutions of the optimization problem ( 22) in the variable . Notice that (28) can be solved for 2 as 2 = 2 1 2 ★ = +min{ 1 , 2 } (-1+ +min{ 1 , 2 }) and, calling by , this can be arranged to get (23), which connects optimal values of with optimal values of 0 , 1 , 2 . Henceforth, we denote by . We note that since 2 ∈ [0, +∞], from (28) it follows that ∈ [ , ] [-min{ 1 , 2 }, min{0, 1 -min{ 1 , 2 }}]. Hence, we only need to study the function ( ) in this interval. Case 1, 1 < 2 : For the sake of simplicity, and for the most part, we will assume that 1 = 1 and omit the argument for 1 = 1. The argument when 1 = 1 is almost identical to the argument when 1 = 1 if we work with the extended reals [-∞, ∞]. Furthermore, most conclusions for 1 = 1 can be obtained by taking the limit of 1 → 1 with 1 = 1. The only situation where this is not the case is that the polynomial ( ), referenced right after Table 1 , has an expression when 1 = 1 that cannot be obtained as the limit of its expression for 1 = 1 . We will also assume that > 0. Recall that we are already assuming that < 1 because when = 1 our problem is trivial. If = 0 one can check that ( ) is a decreasing line, hence the minimum is at = , and this solution can be obtained from the first row of Table 1 . This solution for = 0 can also be obtained by studying > 0 and taking → 0. Below thus assume that 0 < < 1. We start by studying the second derivative of . A direct computation yields d 2 d 2 = ˜ ( ) + , where = 2 2 1 1 -2 (143) and ˜ ( ) = - 2 2 1 1 (-3 2 ( 1 -1)-2 3 +( 1 -1) 2 1) (( + 1 ) 2 -1) 3 . A tedious calculation (omitted) shows that d 2 ˜ d 2 ≥ 0 for ∈ [ , ] (i.e. it is convex), that ˜ ( ) < ˜ ( ), and that d ˜ d ( = ) < 0. From this it follows that, depending on the value of 2 , the concavity d 2 d 2 is positive or negative, as illustrated in the following figure: To be specific, starting from large 2 , i.e. small -, and decreasing its value, i.e. increasing -, we obtain the following four scenarios. While 2 is large andis bellow 2 , the function is convex. After 2 reaches a value 1 at whichtouches 2 , the function is convex for small , then concave, and then convex for large . As 2 keeps decreasing, and after it reaches a value 2 at convex x L < l a t e x i t s h a 1 _ b a s e 6 4 = " E 4 e B / m r c f / + 0 1 q K y w v + T 5 l h w y  i Y = " > A A A B 6 n i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h C s w p 0 E t A z a W F h E N B + Q H G F v M 5 c s 2 d s 7 d v f E c O Q n 2 F g o Y u s v s v P f u E m u 0 M Q H A 4 / 3 Z p i Z F y S C a + O 6 3 8 7 K 6 t r 6 x m Z h q 7 i 9 s 7 u 3 X z o 4 b O o 4 V Q w b L B a x a g d U o + A S G 4 Y b g e 1 E I Y 0 C g a 1 g d D 3 1 W 4 + o N I / l g x k n 6 E d 0 I H n I G T V W u n / q 3 f Z K Z b f i z k C W i Z e T M u S o 9 0 p f 3 X 7 M 0 g i l Y Y J q 3 f H c x P g Z V Y Y z g Z N i N 9 W Y U D a i A + x Y K m m E 2 s 9 m p 0 7 I q V X 6 J I y V L W n I T P 0 9 k d F I 6 3 E U 2 M 6 I m q F e 9 K b i f 1 4 n N e G l n 3 G Z p A Y l m y 8 K U 0 F M T K Z / k z 5 X y I w Y W 0 K Z 4 v Z W w o Z U U W Z s O k U b g r f 4 8 j J p n l e 8 a q V 6 V y 3 X r v I 4 C n A M J 3 A G H l x A D W 6 g D g 1 g M I B n e I U 3 R z g v z r v z M W 9 d c f K Z I / g D 5 / M H N u K N w g = = < / l a t e x i t > x R < l a t e x i t s h a 1 _ b a s e 6 4 = " c 6 n 9 i D o N s r u j p e V k W h O W e G r F 5 f g = " > A A A B 6 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h V 0 J 6 D H o x W N 8 5 A H J E m Y n v c m Q 2 d l l Z l Y M S z 7 B i w d F v P p F 3 v w b J 8 k e N L G g o a j q p r s r S A T X x n W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y 1 H G q G D Z Y L G L V D q h G w S U 2 D D c C 2 4 l C G g U C W 8 H o e u q 3 H l F p H s s H M 0 7 Q j + h A 8 p A z a q x 0 / 9 S 7 6 5 X K b s W d g S w T L y d l y F H v l b 6 6 / Z i l E U r D B N W 6 4 7 m J 8 T O q D G c C J 8 V u q j G h b E Q H 2 L F U 0 g i 1 n 8 1 O n Z B T q / R J G C t b 0 p C Z + n s i o 5 H W 4 y i w n R E 1 Q 7 3 o T c X / v E 5 q w k s / 4 z J J D U o 2 X x S m g p i Y T P 8 m f a 6 Q G T G 2 h D L F 7 a 2 E D a m i z N h 0 i j Y E b / H l Z d I 8 r 3 j V S v W 2 W q 5 d 5 X E U 4 B h

Õ(x)

< l a t e x i t s h a 1 _ b a s e 6 4 = " z + / 3 q G g O P y H 3 R N D B U s O R E n j P e Y 8 = " > A A A B 8 3 i c b V D L S g N B E J y N r x h f U Y 9 e B o M Q L 2 F X A n o M e v F m B P O A 7 B J m Z z v J k N k H M 7 1 i W P I b X j w o 4 t W f 8 e b f O E n 2 o I k F D U V V N 9 1 d f i K F R t v + t g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U c W j x W M a q 6 z M N U k T Q Q o E S u o k C F v o S O v 7 4 Z u Z 3 H k F p E U c P O E n A C 9 k w E g P B G R r J d V H I A L K 7 a f X p v F + u 2 D V 7 D r p K n J x U S I 5 m v / z l B j F P Q 4 i Q S 6 Z 1 z 7 E T 9 D K m U H A J 0 5 K b a k g Y H 7 M h 9 A y N W A j a y + Y 3 T + m Z U Q I 6 i J W p C O l c / T 2 R s V D r S e i b z p D h S C 9 7 M / E / r 5 f i 4 M r L R J S k C B F f L B q k k m J M Z w H Q Q C j g K C e G M K 6 E u Z X y E V O M o 4 m p Z E J w l l 9 e J e 2 L m l O v 1 e / r l c Z 1 H k e R n J B T U i U O u S Q N c k u a p E U 4 S c g z e S V v V m q 9 W O / W x 6 K 1 Y O U z x + Q P r M 8 f s t K R d w = = < / l a t e x i t > x L < l a t e x i t s h a 1 _ b a s e 6 4 = " E 4 e B / m r c f / + 0 1 q K y w v + T 5 l h w y i Y = " > A A A B 6 n i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h C s w p 0 E t A z a W F h E N B + Q H G F v M 5 c s 2 d s 7 d v f E c O Q n 2 F g o Y u s v s v P f u E m u 0 M Q H A 4 / 3 Z p i Z F y S C a + O 6 3 8 7 K 6 t r 6 x m Z h q 7 i 9 s 7 u 3 X z o 4 b O o 4 V Q w b L B a x a g d U o + A S G 4 Y b g e 1 E I Y 0 C g a 1 g d D 3 1 W 4 + o N I / l g x k n 6 E d 0 I H n I G T V W u n / q 3 f Z K Z b f i z k C W i Z e T M u S o 9 0 p f 3 X 7 M 0 g i l Y Y J q 3 f H c x P g Z V Y Y z g Z N i N 9 W Y U D a i A + x Y K m m E 2 s 9 m p 0 7 I q V X 6 J I y V L W n I T P 0 9 k d F I 6 3 E U 2 M 6 I m q F e 9 K b i f 1 4 n N e G l n 3 G Z p A Y l m y 8 K U 0 F M T K Z / k z 5 X y I w Y W 0 K Z 4 v Z W w o Z U U W Z s O k U b g r f 4 8 j J p n l e 8 a q V 6 V y 3 X r v I 4 C n A M J 3 A G H l x A D W 6 g D g 1 g M I B n e I U 3 R z g v z r v z M W 9 d c f K Z I / g D 5 / M H N u K N w g = = < / l a t e x i t > x R < l a t e x i t s h a 1 _ b a s e 6 4 = " c 6 n 9 i D o N s r u j p e  V k W h O W e G r F 5 f g = " > A A A B 6 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h V 0 J 6 D H o x W N 8 5 A H J E m Y n v c m Q 2 d l l Z l Y M S z 7 B i w d F v P p F 3 v w b J 8 k e N L G g o a j q p r s r S A T X x n W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y 1 H G q G D Z Y L G L V D q h G w S U 2 D D c C 2 4 l C G g U C W 8 H o e u q 3 H l F p H s s H M 0 7 Q j + h A 8 p A z a q x 0 / 9 S 7 6 5 X K b s W d g S w T L y d l y F H v l b 6 6 / Z i l E U r D B N W 6 4 7 m J 8 T O q D G c C J 8 V u q j G h b E Q H 2 L F U 0 g i 1 n 8 1 O n Z B T q / R J G C t b 0 p C Z + n s i o 5 H W 4 y i w n R E 1 Q 7 3 o T c X / v E 5 q w k s / 4 z J J D U o 2 X x S m g p i Y T P 8 m f a 6 Q G T G 2 h D L F 7 a 2 E D a m i z N h 0 i j Y E b / H l Z d I 8 r 3 j V S v W 2 W q 5 d 5 X E U 4 B h < l a t e x i t s h a 1 _ b a s e 6 4 = " z + / 3 q G g O P y H 3 R N D B U s O R E n j P e Y 8 = " > A A A B 8 3 i c b V D L S g N B E J y N r x h f U Y 9 e B o M Q L 2 F X A n o M e v F m B P O A 7 B J m Z z v J k N k H M 7 1 i W P I b X j w o 4 t W f 8 e b f O E n 2 o I k F D U V V N 9 1 d f i K F R t v + t g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U c W j x W M a q 6 z M N U k T Q Q o E S u o k C F v o S O v 7 4 Z u Z 3 H k F p E U c P O E n A C 9 k w E g P B G R r J d V H I A L K 7 a f X p v F + u 2 D V 7 D r p K n J x U S I 5 m v / z l B j F P Q 4 i Q S 6 Z 1 z 7 E T 9 D K m U H A J 0 5 K b a k g Y H 7 M h 9 A y N W A j a y + Y 3 T + m Z U Q I 6 i J W p C O l c / T 2 R s V D r S e i b z p D h S C 9 7 M / E / r 5 f i 4 M r L R J S k C B F f L B q k k m J M Z w H Q Q C j g K C e G M K 6 E u Z X y E V O M o 4 m p Z E J w l l 9 e J e 2 L m l O v 1 e / r l c Z 1 H k e R n J B T U i U O u S Q N c k u a p E U 4 S c g z e S V v V m q 9 W O / W x 6 K 1 Y O U z x + Q P r M 8 f s t K R d w = = < / l a t e x i t > x L < l a t e x i t s h a 1 _ b a s e 6 4 = " E 4 e B / m r c f / + 0 1 q K y w v + T 5 l h w y i Y = " > A A A B 6 n i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h C s w p 0 E t A z a W F h E N B + Q H G F v M 5 c s 2 d s 7 d v f E c O Q n 2 F g o Y u s v s v P f u E m u 0 M Q H A 4 / 3 Z p i Z F y S C a + O 6 3 8 7 K 6 t r 6 x m Z h q 7 i 9 s 7 u 3 X z o 4 b O o 4 V Q w b L B a x a g d U o + A S G 4 Y b g e 1 E I Y 0 C g a 1 g d D 3 1 W 4 + o N I / l g x k n 6 E d 0 I H n I G T V W u n / q 3 f Z K Z b f i z k C W i Z e T M u S o 9 0 p f 3 X 7 M 0 g i l Y Y J q 3 f H c x P g Z V Y Y z g Z N i N 9 W Y U D a i A + x Y K m m E 2 s 9 m p 0 7 I q V X 6 J I y V L W n I T P 0 9 k d F I 6 3 E U 2 M 6 I m q F e 9 K b i f 1 4 n N e G l n 3 G Z p A Y l m y 8 K U 0 F M T K Z / k z 5 X y I w Y W 0 K Z 4 v Z W w o Z U U W Z s O k U b g r f 4 8 j J p n l e 8 a q V 6 V y 3 X r v I 4 C n A M J 3 A G H l x A D W 6 g D g 1 g M I B n e I U 3 R z g v z r v z M W 9 d c f K Z I / g D 5 / M H N u K N w g = = < / l a t e x i t > x R < l a t e x i t s h a 1 _ b a s e 6 4 = " c 6 n 9 i D o N s r u j p e  V k W h O W e G r F 5 f g = " > A A A B 6 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h V 0 J 6 D H o x W N 8 5 A H J E m Y n v c m Q 2 d l l Z l Y M S z 7 B i w d F v P p F 3 v w b J 8 k e N L G g o a j q p r s r S A T X x n W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y 1 H G q G D Z Y L G L V D q h G w S U 2 D D c C 2 4 l C G g U C W 8 H o e u q 3 H l F p H s s H M 0 7 Q j + h A 8 p A z a q x 0 / 9 S 7 6 5 X K b s W d g S w T L y d l y F H v l b 6 6 / Z i l E U r D B N W 6 4 7 m J 8 T O q D G c C J 8 V u q j G h b E Q H 2 L F U 0 g i 1 n 8 1 O n Z B T q / R J G C t b 0 p C Z + n s i o 5 H W 4 y i w n R E 1 Q 7 3 o T c X / v E 5 q w k s / 4 z J J D U o 2 X x S m g p i Y T P 8 m f a 6 Q G T G 2 h D L F 7 a 2 E D a m i z N h 0 i j Y E b / H l Z d I 8 r 3 j V S v W 2 W q 5 d 5 X E U 4 B h < l a t e x i t s h a 1 _ b a s e 6 4 = " z + / 3 q G g O P y H  3 R N D B U s O R E n j P e Y 8 = " > A A A B 8 3 i c b V D L S g N B E J y N r x h f U Y 9 e B o M Q L 2 F X A n o M e v F m B P O A 7 B J m Z z v J k N k H M 7 1 i W P I b X j w o 4 t W f 8 e b f O E n 2 o I k F D U V V N 9 1 d f i K F R t v + t g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U c W j x W M a q 6 z M N U k T Q Q o E S u o k C F v o S O v 7 4 Z u Z 3 H k F p E U c P O E n A C 9 k w E g P B G R r J d V H I A L K 7 a f X p v F + u 2 D V 7 D r p K n J x U S I 5 m v / z l B j F P Q 4 i Q S 6 Z 1 z 7 E T 9 D K m U H A J 0 5 K b a k g Y H 7 M h 9 A y N W A j a y + Y 3 T + m Z U Q I 6 i J W p C O l c / T 2 R s V D r S e i b z p D h S C 9 7 M / E / r 5 f i 4 M r L R J S k C B F f L B q k k m J M Z w H Q Q C j g K C e G M K 6 E u Z X y E V O M o 4 m p Z E J w l l 9 e J e 2 L m l O v 1 e / r l c Z 1 H k e R n J B T U i U O u S Q N c k u a p E U 4 S c g z e S V v V m q 9 W O / W x 6 K 1 Y O U z x + Q P r M 8 f s t K R d w = = < c O j R x 2 n i m G L x S J W n Y B q F F x i y 3 A j s J M o p F E g s B 2 M G z O / / Y R K 8 1 g + m E m C f k S H k o e c U W O l + 4 t G v 1 x x q + 4 c Z J V 4 O a l A j m a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 p l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j Y Z c I X M i I k l l C l u b y V s R B V l x o Z T s i F 4 y y + v k s f L q l e r 1 u 5 q l f p N H k c R T u A U z s G D K 6 j D L T S h B Q x C e I Z X e H P G z o v z 7 n w s W g t O P n M M f + B 8 / g A B f 4 0 F < / l a t e x i t > x L < l a t e x i t s h a 1 _ b a s e 6 4 = " E 4 e B / m r c f / + 0 1 q K y w v + T 5 l h w y i Y = " > A A A B 6 n i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h C s w p 0 E t A z a W F h E N B + Q H G F v M 5 c s 2 d s 7 d v f E c O Q n 2 F g o Y u s v s v P f u E m u 0 M Q H A 4 / 3 Z p i Z F y S C a + O 6 3 8 7 K 6 t r 6 x m Z h q 7 i 9 s 7 u 3 X z o 4 b O o 4 V Q w b L B a x a g d U o + A S G 4 Y b g e 1 E I Y 0 C g a 1 g d D 3 1 W 4 + o N I / l g x k n 6 E d 0 I H n I G T V W u n / q 3 f Z K Z b f i z k C W i Z e T M u S o 9 0 p f 3 X 7 M 0 g i l Y Y J q 3 f H c x P g Z V Y Y z g Z N i N 9 W Y U D a i A + x Y K m m E 2 s 9 m p 0 7 I q V X 6 J I y V L W n I T P 0 9 k d F I 6 3 E U 2 M 6 I m q F e 9 K b i f 1 4 n N e G l n 3 G Z p A Y l m y 8 K U 0 F M T K Z / k z 5 X y I w Y W 0 K Z 4 v Z W w o Z U U W Z s O k U b g r f 4 8 j J p n l e 8 a q V 6 V y 3 X r v I 4 C n A M J 3 A G H l x A D W 6 g D g 1 g M I B n e I U 3 R z g v z r v z M W 9 d c f K Z I / g D 5 / M H N u K N w g = = < / l a t e x i t > x R < l a t e x i t s h a 1 _ b a s e 6 4 = " c 6 n 9 i D o N s r u j p e V k W h O W e G r F 5 f g = " > A A A B 6 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h V 0 J 6 D H o x W N 8 5 A H J E m Y n v c m Q 2 d l l Z l Y M S z 7 B i w d F v P p F 3 v w b J 8 k e N L G g o a j q p r s r S A T X x n W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y 1 H G q G D Z Y L G L V D q h G w S U 2 D D c C 2 4 l C G g U C W 8 H o e u q 3 H l F p H s s H M 0 7 Q j + h A 8 p A z a q x 0 / 9 S 7 6 5 X K b s W d g S w T L y d l y F H v l b 6 6 / Z i l E U r D B N W 6 4 7 m J 8 T O q D G c C J 8 V u q j G h b E Q H 2 L F U 0 g i 1 n 8 1 O n Z B T q / R J G C t b 0 p C Z + n s i o 5 H W 4 y i w n R E 1 Q 7 3 o T c X / v E 5 q w k s / 4 z J J D U o 2 X x S m g p i Y T P 8 m f a 6 Q G T G 2 h D L F 7 a 2 E D a m i z N h 0 i j Y E b / H l Z d I 8 r 3 j V S v W 2 W q 5 d 5 X E U 4 B h O 4 A w 8 u I A a 3 E A d G s B g A M / w C m + O c F 6 c d + d j 3 r r i 5 D N H 8 A f O 5 w 8 / + o 3 I < / l a t e x i t > Õ(x) < l a t e x i t s h a 1 _ b a s e 6 4 = " z + / 3 q G g O P y H whichtouches 1 , the function is concave for small , and then convex for large . Finally, after 2 reaches a value 3 at whichtouches 3 , the function is concave. Note that by definition 0 ≤ 3 < 2 < 1 . 3 R N D B U s O R E n j P e Y 8 = " > A A A B 8 3 i c b V D L S g N B E J y N r x h f U Y 9 e B o M Q L 2 F X A n o M e v F m B P O A 7 B J m Z z v J k N k H M 7 1 i W P I b X j w o 4 t W f 8 e b f O E n 2 o I k F D U V V N 9 1 d f i K F R t v + t g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U c W j x W M a q 6 z M N U k T Q Q o E S u o k C F v o S O v 7 4 Z u Z 3 H k F p E U c P O E n A C 9 k w E g P B G R r J d V H I A L K 7 a f X p v F + u 2 D V 7 D r p K n J x U S I 5 m v / z l B j F P Q 4 i Q S 6 Z 1 z 7 E T 9 D K m U H A J 0 5 K b a k g Y H 7 M h 9 A y N W A j a y + Y 3 T + m Z U Q I 6 i J W p C O l c / T 2 R s V D r S e i b z p D h S C 9 7 M / E / r 5 f i 4 M r L R J S k C B F f L B q k k m J M Z w H Q Q C j g K C e G M K 6 E u Z X y E V O M o 4 m p Z E J w l l 9 e J e 2 L m l O v 1 e / r l c Z 1 H k e R n J B T U i U O u S Q N c k u a p E U 4 S c g z e S V v V m q 9 W O / W x 6 K 1 Y O U z x + Q P r M 8 f s t K R d w = = < / l a t e x i t > convex concave convex C < l a t e x i t s h a 1 _ b a s e 6 4 = " A D H Z 7 J M b I n p b l Y 7 b S d 6 p 4 z i l 4 C Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R g h 6 L v X i s Y m 2 h D W W z n b R L N 5 u w u x F K 6 D / w 4 k E R r / 4 j b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j R x 2 n i m G L x S J W n Y B q F F x i y 3 A j s J M o p F E g s B 2 M G z O / / Y R K 8 1 g + m E m C f k S H k o e c U W O l + 4 t G v 1 x x q + 4 c Z J V 4 O a l A j m a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 p l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j Y Z c I X M i I k l l C l u b y V s R B V l x o Z T s i F 4 y y + v k s f L q l e r 1 u 5 q l f p N H k c R T u A U z s G D K 6 j D L T S h B Q x C e I Z X Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R g h 6 L v X i s Y m 2 h D W W z n b R L N 5 u w u x F K 6 D / w 4 k E R r / 4 j b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j R x 2 n i m G L x S J W n Y B q F F x i y 3 A j s J M o p F E g s B 2 M G z O / / Y R K 8 1 g + m E m C f k S H k o e c U W O l + 4 t G v 1 x x q + 4 c Z J V 4 O a l A j m a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 p l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j Y Z c I X M i I k l l C l u b y V s R B V l x o Z T s i F 4 y y + v k s f L q l e r 1 u 5 q l f p N H k c R T u A U z s G D K 6 j D L T S h B Q x C e I Z X Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R g h 6 L v X i s Y m 2 h D W W z n b R L N 5 u w u x F K 6 D / w 4 k E R r / 4 j b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j R x 2 n i m G L x S J W n Y B q F F x i y 3 A j s J M o p F E g s B 2 M G z O / / Y R K 8 1 g + m E m C f k S H k o e c U W O l + 4 t G v 1 x x q + 4 c Z J V 4 O a l A j m a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 p l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j Y Z c I X M i I k l l C l u b y V s R B V l x o Z T s i F 4 y y + v k s f L q l e r 1 u 5 q l f p N H k c R T u A U z s G D K 6 j D L T S h B Q x C e I Z X e H P G z o v z 7 n w s W g t O P n M M f + B 8 / g A B f 4 0 F < / l a t e x i t > concave convex concave A 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " b E D L X W T s V q L d D J g D 6 h G l W 5 4 N C z g = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 C j M S i M e o F 4 8 R z Q L J E H o 6 P U m T n p 6 h u 0 Y I Q z 7 B i w d F v P p F 3 v w b O 4 s Q t w c F j / e q q K o X J F I Y d N 0 P J 7 e y u r a + k d 8 s b G 3 v 7 O 4 V 9 w + a J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R 1 d R v 3 X N t R K z u c J x w P 6 I D J U L B K F r p 9 q L n 9 Y o l r + z O Q N x f 5 M s q w Q L 1 X v G 9 2 4 9 Z G n G F T F J j O p 6 b o J 9 R j Y J J P i l 0 U 8 M T y k Z 0 w D u W K h p x 4 2 e z U y f k x C p 9 E s b a l k I y U 5 c n M h o Z M 4 4 C 2 x l R H J q f 3 l T 8 y + u k G J 7 7 m V B J i l y x + a I w l Q R j M v 2 b 9 I X m D O X Y E s q 0 s L c S N q S a M r T p F J Z D + J 8 0 z 8 p e p V y 5 q Z R q l 4 s 4 8 n A E x 3 A K H l S h B t d Q h w Y w G M A D P M G z I 5 1 H 5 8 V 5 n b f m n M X M I X y D 8 / Y J u 5 2 N c Q = = < / l a t e x i t > A 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " e 6 x g b r B Q c X v K b I v y 3 u 1 K Z Y O T N E 4 = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 C j M h o M e o F 4 8 R z Q L J E H o 6 N U m T n p 6 h u 0 c I Q z 7 B i w d F v P p F 3 v w b O 4 s Q t w c F j / e q q K o X J I J r 4 7 o f T m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h 8 0 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 m v q t e 1 S a x / L O j B P 0 I z q Q P O S M G i v d X v Q q v W L J K 7 s z E P c X + b J K s E C 9 V 3 z v 9 m O W R i g N E 1 T r j u c m x s + o M p w J n B S 6 q c a E s h E d Y M d S S S P U f j Y 7 d U J O r N I n Y a x s S U N m 6 v J E R i O t x 1 F g O y N q h v q n N x X / 8 j q p C c / 9 j M s k N S j Z f F G Y C m J i M v 2 b 9 L l C Z s T Y E s o U t 7 c S N q S K M m P T K S y H 8 D 9 p V s p e t V y 9 q Z Z q l 4 s 4 8 n A E x 3 A K H p x B D a 6 h D g 1 g M I A H e I J n R z i P z o v z O m / N O Y u Z Q / g G 5 + 0 T v S G N c g = = < / l a t e x i t > A 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " p M F o Z h a I y t d m k U f N 6 R E 8 F 3 I z U Q c = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 C j M a 0 G P U i 8 e I Z o F k C D 2 d m q R J T 8 / Q 3 S O E I Z / g x Y M i X v 0 i b / 6 N n U W I 2 4 O C x 3 t V V N U L E s G 1 c d 0 P J 7 e 0 v L K 6 l l 8 v b G x u b e 8 U d / c a O k 4 V w z q L R a x a A d U o u M S 6 4 U Z g K 1 F I o 0 B g M x h e T f z m P S r N Y 3 l n R g n 6 E e 1 L H n J G j Z V u L 7 q n 3 W L J K 7 t T E P c X + b J K M E e t W 3 z v 9 G K W R i g N E 1 T r t u c m x s + o M p w J H B c 6 q c a E s i H t Y 9 t S S S P U f j Y 9 d U y O r N I j Y a x s S U O m 6 u J E R i O t R 1 F g O y N q B v q n N x H / 8 t q p C c / 9 j M s k N S j Z b F G Y C m J i M v It is possible to compute closed-form expressions for the -coordinate of the points 1 , 2 , 3 , which we denote by 1 , 2 , and 3 . Namely, 1 = 2 1 (2 + 2 1 ), 2 = 2 1 √ 1-1 +1 1 -1 2 , if 1 ≤ 1, and 2 = 2 1 1 -1 1 -1 2 1 + 1 , if 1 > 1, and 3 = 2 1 max 2 1 -1 2 , 2 1 1 -1 . Using the closed-form expression for -, see (143), we find closed-form expressions for 1 , 2 , and 3 as 1 = 2 : 2 ( 2 ) + ( 2 ) = 0, 2 = 2 : 1 ( 2 ) + ( 2 ) = 0, and 3 = 2 : 3 ( 2 ) + ( 2 ) = 0. By definition of 1 , 2 , and 3 , these equations have a unique solution when 2 > 1 > 0 and their explicit expressions are given in ( 45), ( 46) and ( 47) respectively. Now that we have characterized the curvature of ( ), we are ready to locate its global minimum. To do so, we will use the curvature of ( ) together with the first-order optimally condition d d = 0, and the following three extra pieces of information: the sign of the derivative of ( ) at = ; the sign of the derivative of ( ) at = ; and the sign of ( ) -( ). A direct computation yields d d = 0 + 1 + 2 2 + 3 3 + 4 4 + 5 5 ( 1 -( + 1 ) 2 ) 2 ( 1 -2 ) , if 1 = 1, and (144) d d = 0 + 1 + 2 2 + 3 3 (2 + ) 2 (-1 + 2 ) , if 1 = 1. ( ) where the coefficients 0 , . . . , 5 and 0 . . . , 3 are, apart from a multiplying constant, given in ( 51)-( 60). The roots of the first denominator, i.e. -√ 1 -1 and √ 1 -1 , are not a root of the first numerator when 1 = 1, and the roots of the second denominator, i.e. -2, are not a root of the second numerator when 2 > 1 = 1. Therefore, the first-order optimality conditions are ( ) = 0. Not all solutions of ( ) = 0 minimize ( ). To locate the minimizer, we use the sign of the derivative of ( ) at = ; the sign of the derivative of ( ) at = ; and the sign of ( ) -( ). These signs can be determined using Lemma M.13. Lemma M.13 also proves Remark 10. Case 1.4, 1 < 2 ∧ 2 < 2 < 1 ∧ < ∧ < : The function is first convex, then concave and then convex. It is decreasing at both = and at = . Hence has at most one local minimum in the interior of the domain, at = 1 if it exists, and a local minimum at , the domain being [ , ] . Therefore, the minimum can be expressed as 1 . Case 1.5, 1 < 2 ∧ 2 < 2 < 1 ∧ < < : The function is first convex, then concave and then convex. It is decreasing at = and increasing at = . Hence has no local minimum at the end points of the domain [ , ] . Also, either has exactly one local minimum in the interior of the domain, at = 1 , or, has exactly two local minimum and one local maximum (resp.) in the interior of the domain, at = 1 , = 3 and = 2 respectively. Therefore, the minimum can be expressed as 1 3 . Note that 1 always exists but 3 might not. Case 1.6, 1 < 2 ∧ 2 < 2 < 1 ∧ < < : The function is first convex, then concave and then convex. It is increasing at = and decreasing at = . Hence, the minimum is either at the or , depending whether > or < . This case is an example where it is clear that allowing = leads to non-uniqueness in the optimal . See Remark 9. Case 1.7, 1 < 2 ∧ 2 < 2 < 1 ∧ > ∧ > : The function is first convex, then concave, and then convex. It is increasing at both = and = . Hence either has no critical point in the interior of the domain, or it has two critical point in the interior of the domain, at = 1 (local maximum) and = 2 (local minimum) if they exists, and always has a local minimum at . Therefore, the minimum can be expressed as 2 . Note that always exists, but 2 not always exists. Case 1.8, 1 < 2 ∧ 3 < 2 ≤ 2 ∧ < ∧ < : The function is first concave and convex. and is decreasing at both = and = . Hence its minimum is at = . Case 1.9, 1 < 2 ∧ 3 < 2 ≤ 2 ∧ < < : The function is first concave and convex. It is decreasing at = and increasing at = . Hence, has exactly one local minimum in the interior of the domain, which is at = 1 and always exists, and which is also a global minimum. Case 1.10, 1 < 2 ∧ 3 < 2 ≤ 2 ∧ < < : The function is first concave and convex. It is increasing at = and decreasing at = . Hence, the minimum is either at the or , depending whether > or < . Case 1.11, 1 < 2 ∧ 3 < 2 ≤ 2 ∧ > ∧ > : The function is first concave and convex. It is increasing at both = and = . Hence either has no critical point in the interior of the domain or it has two critical point in the interior of the domain, at = 1 (local maximum) and = 2 (local minimum) if they exists, and always has a local minimum at . Therefore, the minimum can be expressed as 2 . Note that always exists, but 2 not always exists. Case 1.12, 1 < 2 ∧ 2 ≤ 3 ∧ < ∧ < : The function is concave. It is decreasing at both = and = . Hence, its minimum is at = When 1 < 2 ∧ 2 ≤ 3 , Lemma M.13 and a tedious calculation (omitted) shows that that < , which is why the 2nd and 3rd rows of the last column of Table 1 are empty. A simpler way to see that the 2nd and 3rd rows of the last column of Table 1 must be empty is as follows. Since 1 < 2 ∧ 2 ≤ 3 we know (from the argument following Figure 3 ) that is concave. If < and > (i.e. 2 holds) then by Lemma M.13 we known that is decreasing at = and increasing at = , but this is impossible for a concave function. Case 1.13, 1 < 2 ∧ 2 ≤ 3 ∧ < < : The function is concave. It is increasing at = and decreasing at = . Hence, the minimum is either at the or , depending whether > or < . Case 1.14, 1 < 2 ∧ 2 ≤ 3 ∧ > ∧ > : The function is concave. It is increasing at = and at = . Hence, the minimum is at . Case 2, 1 > 2 : The case when 2 = 1 can be proved by taking appropriate limits of the case when 2 = 1. For now, we assume that 2 = 1. generate the plot run the file named RunMeToGenerateFigure_4.m. It runs using Matlab 2020b. We ran it using a MacBook Pro with 2.6 GHz 6-Core Intel Core i7 and 32 GB 2667 MHz DDR4. In this machine it takes about 10 hours to run. In Figure 5 we plot the test error E has a function of when 2 = 10, when we have = 2 train samples, and when the number of features is very large, namely, = 10000. We do so in two different settings: (1) the AF is a fixed ReLU; (2) the AF is a numerically optimized quadratic function. This AF is optimized as follows. For each value of , we run a Bayesian optimization subroutine that minimizes the test error across all possible quadratic AFs. We observe that choosing an optimized AF and being "careless" about the choice of regularization leads to as good results as using a ReLU and optimizing , which is the common practice. The code to produce Figure 5 is in the following Github link: https://github.com/ Jeffwang87/RFR_AF. This code is also available in the supplementary zip file provided. To generate the plot run the file named RunMeToGenerateFigure_5.m. It runs using Matlab



) < 0, which implies that to achieve the minimum one must have 1 → ∞. In this case, if we express as a function of 1 and ★ and take 1 → ∞ and ★ → 0, we get → .



where 1 is the 1st component of x. We define the Signal to Noise Ratio (SNR) by =

did this in the context of shallow but infinitely wide NN and the works Garriga-Alonso et al. (2019); Novak et al. (2019); de G. Matthews et al. (2018); Hazan & Jaakkola (2015) did this for deep networks. Daniely et al. (2016); Daniely (2017) connected the RFR model to training a NN with gradient descent.

Figure 2: The first two functions are the only two possible types of continuous functions that satisfy (126). The right-most function also satisfies (126) but is not continuous.

O 4 A w 8 u I A a 3 E A d G s B g A M / w C m + O c F 6 c d + d j 3 r r i 5 D N H 8 A f O 5 w 8 / + o 3 I < / l a t e x i t >

O 4 A w 8 u I A a 3 E A d G s B g A M / w C m + O c F 6 c d + d j 3 r r i 5 D N H 8 A f O 5 w 8 / + o 3 I < / l a t e x i t >

O 4 A w 8 u I A a 3 E A d G s B g A M / w C m + O c F 6 c d + d j 3 r r i 5 D N H 8 A f O 5 w 8 / + o 3 I < / l a t e x i t >

/ l a t e x i t > C < l a t e x i t s h a 1 _ b a s e 6 4 = " A D H Z 7 J M b I n p b l Y 7 b S d 6 p 4 z i l 4C Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R g h 6 L v X i s Y m 2 h D W W z n b R L N 5 u w u x F K 6 D / w 4 k E Rr / 4 j b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5

e H P G z o v z 7 n w s W g t O P n M M f + B 8 / g A B f 4 0 F < / l a t e x i t > C < l a t e x i t s h a 1 _ b a s e 6 4 = " A D H Z 7 J M b I n p b l Y 7 b S d 6 p 4 z i l 4 C

e H P G z o v z 7 n w s W g t O P n M M f + B 8 / g A B f 4 0 F < / l a t e x i t > C < l a t e x i t s h a 1 _ b a s e 6 4 = " A D H Z 7 J M b I n p b l Y 7 b S d 6 p 4 z i l 4 C

Figure3: Depending on the value of 2 , the function is convex, convex-concave-convex, concave-convex, or concave respectively. The points A, B, and C will be referenced later in the proof. It is possible to compute closed-form expressions for the -coordinate of these points.

Figure 4: Learning a function from the MNIST data set also produces a double descent curve, i.e. the test error decreases as the model's complexity increases, then it increases until the interpolation threshold, which is around 1 / 2 = 2, and then it decreases against past the interpolation threshold. By optimizing the AF this phenomenon disappears. The meaning of this figure is related to the meaning of Figure 1-(A) in the main text.

Figure 5: Learning a function from the MNIST data set using the RFR model can be improved by selecting the appropriate ridge regularization parameter , around 10 -1 in the plot. If we are not careful about this choice, but instead we use an optimized AF, we can achieve similarly good performance. The meaning of this figure is related to the meaning of Figure 1-(C) in the main text.

al. (2021); Hassani & Javanmard (2022) (cf Appendix C). The results of Hassani & Javanmard (2022) would need to be generalized from a ReLU to general AFs before one could optimize the AFs' parameters. Published as a conference paper at ICLR 2023 Zitong Yang, Yu Bai, and Song Mei. Exact gap between generalization error and uniform convergence in random feature models. In International Conference on Machine Learning, pp. 11704-11715. PMLR, 2021. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472-7482. PMLR, 2019.

ACKNOWLEDGMENTS

We thank Song Mei for valuable discussions regarding the asymptotic properties of the Random Feature Regression model. We thank Piotr Suwara for his help regarding Hermite-type differential equations and their solutions.

annex

Remark 11. Let 1 > 2 . We always have that ∈ [0, 1]. It also holds thatWe have that ∈ [0, 1] if and only if 1 ≤ 2 + 1 2 ( 2 -1) 2 min{ 2 , 1} and.Note that if 1 > 2 = 1 then is not defined, see statement of Theorem 9. If we had attempted to use the formulas above we would have gotten that ∈ [0, 1] if and only if 1 ≤ 1 ≤ 1, this last condition never being met when 1 > 2 = 1. Also, ≤ always, where and are given in ( 67) and ( 68). Furthermore, < 2 if and only if 1/ > min{1, max{0, 2 2 -1}}. Since ≤ , it follows that 1 and 2 cannot simultaneously hold, and hence no two rows can simultaneously hold. We always have that 3 < 2 < 1 , therefore no two columns can simultaneously hold. Remark 11 follows from Lemma M.14 used in the proof of Theorem 9.J.2 STATEMENT OF THEOREM 9 WHEN 1 < 2 Theorem (9 continued). If 1 < 2 then 1 = ( < ), 2 = ( > ),= 2 /(3 2 + 1 -2 min{1 + 1 , 2 1 } + -1 ).Furthermore, the polynomial ( ) is defined as follows.If 1 = 1 then ( ) 0 + 1 + 2 2 + 3 3 + 4 4 + 5 5 where1 = 2( 1 -1) 1 ( ( ( 1 (9 1 -6 2 -4) + 2 ) + 2 1 2 ) -2 1 ) , (52) 2 = 2 1 ( ( + 4 1 (4 1 -3) + 2 (4 -4 = ( (12 1 -1) -3 2 + 2 ) -, (55)If 1 = 1 then ( ) 0 + 1 + 2 2 + 3 3 where 0 = (-10 + 10 2 -4 2 ) + 4 , (57)1 = (-20 + 12 2 -4 2 ) + 4 , (59) and 3 = -2 .(60)J.3 STATEMENT OF THEOREM 9 WHEN 1 > 2Theorem (9 continued). If 1 > 2 and 2 = 1 then 1 = ( 1 < ) ∨ (( < ) ∧ ( < 1 < )), 2 = ( 1 > ) ∨ (( > ) ∧ ( < 1 < )), 1 :( 1 ) = 0, where is a 4th degree polynomial described in Appendix J.4, (61)= (2 1 -2 )/(2 1 -2 + 1 + -1 ), (64)= 2 + ( min{1, 2 }( 2 -1) 2 /2), andIf 1 > 2 and 2 = 1 then 1 = False, 2 = True and eqs. (61)-( 68) hold when reading Table 1 , except (66), which is not defined, and (65Furthermore, the polynomial ( ) is defined as follows.This appendix is referenced in the statement of Theorem 9 in equation ( 61).The coefficient 1 is such that ( 1 ) = 0, wherePublished as a conference paper at ICLR 2023 Lemma M.13. If 1 < 2 , the following relationships hold,Also,where , and are given in (48), (50), and (49) respectively.Furthermore, the following are true and help determine from which row in Table 1 Finally, , ,Proof. The derivation of ( 146)-( 150) follows from direct substitution of = or = into (for which we have expressions from Theorems 1 and 4) or its derivative (in eq. ( 144)-( 145)). The derivation of ( 151)-( 153) follows from the observation that the equations ( 146)-( 150) are linear in , and hence we can easily compute the values of at which these expressions change from a negative value to a positive value. Once an explicit formula for , , and is obtained, it is easy to find the criteria to decide their relative magnitude by comparing the term under parenthesis in the denominators of ( 48), (50), and (49). It is also easy to see from their formulas that their value is always in the range [0, 1].To finish the proof of Theorem 9 we consider the different scenarios in Table 1 . In what follows, statements about the concavity of are justified via the explanation accompanying Figure 3 , and statements about the slope of are based on Lemma M.13.The function is convex and is decreasing at = and decreasing at = . Hence its minimum is at = .The function is convex and it is decreasing at = and increasing at = . Hence it has a unique minimizer at = 1 .When 1 < 2 ∧ 2 ≥ 1 , we can use Lemma M.13 and a tedious calculation (omitted) to show that < , which is why the 4th and 5th rows of the first column of Table 1 are empty. A simpler way to see that the 4th and 5th rows of the first column of Table 1 must be empty is as follows. Since 1 < 2 ∧ 2 ≥ 1 then we know (from the argument following with Figure 3 ) that is convex. If > and < (i.e. 1 holds) then by Lemma M.13 we know that is increasing at = and decreasing at = , but this is impossible for a convex function.The function is convex and it is increasing at both = and = . Hence its minimum is at = .We first prove that the second derivative of is convex, just like in the case when 1 < 2 . A direct computation yieldsA tedious calculation (omitted) shows that d 2 ˜ d 2 ≥ 0 for ∈ [ , ] (i.e. it is convex), that ˜ ( ) < ˜ ( ), and that d ˜ d ( = ) < 0. To do this calculation, we recommend the following. First break ˜ into two components. One component proportional to 2 1 , called ˜ 1 , and one component proportional toFrom this it follows that, depending on the value of 1 , the concavity d 2 d 2 is positive or negative. The situation is exactly the same as in the Figure 3 but now the axis is 1 , the points 1 , 2 and 3 are different, and so are the definitions of 1 , 2 and 3 . We do however have that, by definition,With a slight abuse of notation we refer to the -coordinate value of this points by 1 , 2 , and 3 . WeNotice that when 2 = 1 we have that 3 = +∞, and in fact we also have ( ) = +∞. Getting 2 is a bit more complicated. From the first order condition d ˜ d = 0 and the convexity of ˜ -recall that ˜ a rational function -we can extract that the -coordinate of 2 is the unique root of a 4th degree polynomial ℎ(), ℎ 4 = -, Call this root 2 . We then have 2 = ˜ ( 2 ). It turns out that we can write 2 directly as the solution of ( 2 /(( 2 + 2 ★ ) 2 )) = 0, where 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 ( 2 + + 1), and 4 = 16 6 2 . Using the expressions for 1 and 3 , and the expression for , see (154), 2 and 3 are defined as 2 = 1 : 1 ( 1 ) + ( 1 ) = 0, and 3 = 1 : 3 ( 1 ) + ( 1 ) = 0. By definition of 2 and 3 , these equations have a unique solution when 1 > 2 > 0 and their expressions are given in ( 62) and ( 63) respectively. We can also define 1 as the solution of 1 = 1 : 2 ( 1 ) + ( 1 ) = 0. By definition of 1 , the solution is unique in the range 1 > 2 > 0. We can use the fact that 2 = ˜ ( 2 ) to write that 1 = 2 + 2 2 1 ˜ ( 2 ) . We can also use the fact that ( 2 /(( 2 + 2 ★ ) 2 )) = 0, which implies that (-( 1 ) / (( 2 + 2 ★ ) 2 )) = 0 when 1 = 1 , to write 1 as the root of a 4th degree polynomial ( ) that is specified in Appendix J.4, which is the way in which we decide to state Theorem 9. Now that we have characterized the curvature of ( ), we are ready to locate its global minimum. To do so, we will use the curvature of ( ) together with the first-order optimally condition d d = 0, and the following three extra pieces of information: the sign of the derivative of ( ) at = ; the sign of the derivative of ( ) at = ; and the sign of ( ) -( ).A direct computation yieldswhere the coefficients 0 , . . . , 5 are, apart from a multiplying constant, given in ( 70)-( 75). The roots of the denominator, i.e. -√ 2 -2 and √ 2 -2 , are not a root of the numerator, therefore, the first order optimalit conditions are ( ) = 0. Not all solutions of ( ) = 0 minimize ( ). To locate the minimizer, we use the sign of the derivative of ( ) at = ; the sign of the derivative of ( ) at = ; and the sign of ( ) -( ). These signs can be determined using Lemma M.14. Lemma M.14 also proves Remark 11.Published as a conference paper at ICLR 2023 Lemma M.14. If 1 > 2 then the following relationships hold.If 2 > 1 then (160).(161). ( 164)Also, the following is true.Conditions 1 and 2 are defined in Theorem 9.Constant is defined in equation (65) in Theorem 9.Furthermore, we always have thatFinally, ≥ always, where and are given in (67) and (68). Furthermore, < 2 if and only if 1/ > min{1, max{0, 2 2 -1}}.Proof. Except ( 162) and ( 157), the derivation of ( 156)-( 167) follows from direct substitution of = or = into -for which we have expressions from Theorems 1 and 4 -or its derivative in equation 155.Equation ( 162), for when 2 = 1, is obtained by taking the limit of ( 164) and (167) as 2 ↑ 1 and 2 ↓ 1 respectively. Equation ( 157) for when 2 = 1 is obtained by taking the limit of ( 159) and (161) as 2 ↑ 1 and 2 ↓ 1 respectively.

Published as a conference paper at ICLR 2023

The derivation of condition (168) follows from the observation that the equation ( 156) is linear in and is always negative for = 0 and positive for = 1. Hence, we can easily compute a value ∈ [0, 1] at which the expression changes from a negative value to a positive value. The expression for we obtain is (64).The derivation of condition ( 169) can be obtained through following the observations. First notice that d d | = is a linear increasing function of . Now focus on the following three implications.1A direct computation shows that the sufficient condition in the first implication holds if and only if 1 < , where is given in (68); the sufficient condition in the second implication holds if and only if 1 > , where is given in (67); the sufficient condition in the the third implication holds if and only ifis true, we can use the second or third implication to conclude that d d | = > 0. A direct calculation shows that ≥ and that < 2 if and only if 1/ > min{1, max{0, 2 2 -1}}. A direct calculation also shows that ∈ [0, 1] if and only if the conditions in ( 171) and ( 172) hold.Condition (170) can be obtained through following the observations. First, notice that both ( 164) and ( 167) are linear decreasing functions of . Second, when = 1 we always have ( ) -( ) < 0. Therefore, there existsThe expression for is given by (65). From this expression, a direct calculation shows that ≥ 0 if and only ifTo finish the proof of Theorem 9 for 1 > 2 we consider the different scenarios in Table 1 . These are studied via cases that are exactly the same to the Case 1.1 to Case 1.14 for 1 < 2 but with < replaced by 1 and > replaced by 2 because when 1 > 2 it is 1 and 2 that determine the sign of d d | = . For example, for the top left-most cell in Table 1 , both when 1 < 2 or 1 > 2 , we have that is convex and its decreasing at and , so its minimum is at = . As another example, the 4th and 5th rows of the first column are not possible for exactly the same reasons as when 1 < 2 . Namely, when 1 < = 1 then is convex. If > and 1 holds, then by Lemma M.14 we know that is increasing at = and decreasing at = , which is impossible for a convex function. Similarly, the 2nd and 3rd rows of the last column are not possible because when = 1 ≤ 3 then is concave and if < ∧ 2 then Lemma M.14 tells us that is decreasing at = and increasing at = which is not possible. We omit repeating the arguments for Case 1.2 to Case 1.14.Above, for both 1 < 2 and 1 > 2 = 1, Table 1 was derived using the fact that 1 holding implies that the derivative of at = is negative and 2 holding implies that the derivative of at = is positive. It was also derived using the fact that < implies that ( ) > ( ) and that > implies that ( ) < ( ). When 2 = 1, from Lemma M.14, we know that the derivative of at = is +∞, and that ( ) < ( ). Hence we can keep Table 1 for 1 > 2 = 1 unchanged if in this case we set 1 to be false, 2 to be true, and = -∞.

M.6 PROOF OF THEOREM 10

This proof involves heavy algebraic computations. To aid the reader, this paper is accompanied by a Mathemetica file that symbolically checks the equations both in the theorem statement as well as in the proof below. This file is in the supplementary zip file, as well as in the following Github link https://github.com/Jeffwang87/RFR_AF. It is called RunMeToCheckProofOfTheorem10.nb.Proof of Theorem 10. The proof amounts to a long calculus exercise, which we shorten by some careful observations.We first notice that E ∞ 2 and S ∞ 2 can both be written as function of 2 = 2 ( 2 , , 0 , 1 , 2 ), and we can use this to reduce the optimization problem ( 22) to an optimization problem over just one variable.The variable 2 is a function of 0 , 1 , 2 ≥ 0, which we want to optimize, and has range [-∞, 0], the value -∞ being achieved when 2 ★ = 2 -2 1 -2 2 = 0. To avoid having to deal with infinities, we make use of the Mobius transformation = 1+ 2 2 -1 , and instead work with .If we substitute = 1+ 2 -1+ 2 into the left hand side ( 24), and substitute (32) in the resulting expression we confirm that ( 24) is satisfied.Furthermore, if we use = 1+ 2 -1+ 2 and (32) to write as a function of 0 , 1 , 2 , we can use the fact thatOur problem is thus equivalent to solving minOnce we know , any 0 , 1 , 2 that satisfies (24) will be a minimizer.

At this point we compute

and observe that this is a rational function of . The numerator is, apart from a multiplying constant, ( ), and the denominator is zero if and only ifBoth are outside of the range of unless 2 = = 1. When 2 = 1 the = 1 zeros of the denominator only cancel zeros of the numerator if = ★ = 0. In the remainder of the proof we will assume that 2 + 2 ★ > 0. The optimal AFs' parameters when = ★ = 0 can be obtained as a limit when , ★ → 0. Since the zeros of the numerator and of the denominator never cancel out (assuming 2 + 2 ★ > 0), all of the critical points are given by ( ) = 0. Now we compute the value of the derivative at the extremes of the range of , namely, = -1, 1, or -1 + 2 2 . The value of the derivative at = -1 is 2 1 (-1 + ) < 0, which implies that = -1 is not a minimizer. The value of the derivative at = 1 is> 0, and converges to +∞ when 2 → 1, which implies that = 1 is not a minimizer. The value of the derivative at( 2 -1) 2 > 0, and converges to +∞ when 2 → 1, which implies that = -1 + 2 2 is not a minimizer. Another way to see that neither = 1 nor = -1 + 2 2 will be a solution is to see that these choices will not satisfy the equation in (24) unless = 0, which never happens in this regime. This calculation implies that we can assume that ∈ (-1, min{1, -1 + 2 2 }), which we will assume from now on.Finally, we show that the objective is convex in the domain of , which implies that there is only one critical point -that is ( ) = 0 has only one solution in the domain (-1, min{1, -1 + 2 2 }) -and that this critical point is a global minimum. To show that the objective is convex, we compute its second derivative, which isThe minimum of denominator forwhich is always non-negative, and is zero only if = 1, or = -1 + 2 2 , which have already been excluded because they are not minimizers. Hence, for ∈ (-1, min{1, -1 + 2 2 }), the denominator is strictly positive.

Published as a conference paper at ICLR 2023

To show that the numerator is non-negative, we only need to show that 4 2 2 + 2 (3( -2) -5) -3 + 3 + 2 ≥ 0 in the range of . The minimum of 4 2 2 + 2 (3( -2) -5)which is always non-negative.

M.7 PROOF OF THEOREM 11

This proof involves heavy algebraic computations. To aid the reader, this paper is accompanied by a Mathemetica file that symbolically checks the equations both in the theorem statement as well as in the proof below. This file is in the supplementary zip file, as well as in the following Github link https://github.com/Jeffwang87/RFR_AF. It is called RunMeToCheckProofOfTheorem11.nb.Proof of Theorem 11. Similar to proof of Theorem 10, the majority of the proof is a long calculus exercise.We first notice that E ∞ 3 and S ∞ 3 can both be written as function of 1 and 2 . The variable 1 is a function of 0 , 1 , 2 ≥ 0, and has range [-∞, 0], the value -∞ being achieved when 2 ★ = 2 -2 1 -2 2 = 0. To avoid having to deal with infinities, we make use of the Mobius transformation = 1+ 1 1 -1 , and instead work with . If we use = 1+ 1 1 -1 and ( 35) to write as a function of 0 , 1 , 2 , we can use the fact that 0Our problem is thus equivalent to solving min (1-With a direct calculation we can check the following. The derivative d d | =-1 is always negative (recall we are assuming < 1) which implies that there is no local minimum at = -1. The derivatives d d | =1 and d d | =-1+2 1 are always positive (recall that we are assuming 1 > 0 in addition to < 1), which implies that there is no local minimum at either = 1 or = -1 + 2 1 . Therefore, we know that the minimizer has ∈ (-1, min{1, -1is always negative (when < 1), which implies that there is no local minimum at 2 = 0. We thus know that the minimizer of is in the interior of the domain for and 2 and hence it can be found via ∇ = 0, where the gradient is with respect to and 2 . The remainder of the proof considers a few different cases depending on the value of and 1 .Case when = 0:. The only way that d d( 2 ) = 0 is if 2 → ∞ (we already know that = 1), which corresponds to ★ → 0.As we noted in the beginning, is a function of 1 and 2 , and since 1 is a function 2 and 2 1 and 2 is a function of 2 1 and 2 ★ , we know that is a function of 2 1 and 2 ★ . We express in these variables and compute d d( 21 ) when ★ → 0. We get thatBy minimizing the denominator with respect to 1 ≥ 0, we conclude that its minimum is 2 2 1 > 0, which implies that d d( 2Case when 1 = 1 ∧ 0 < ≤ 1 4 :As we noted in the beginning, is a function of 1 and 2 , and since 1 is a function 2 and 2 1 and 2 is a function of 2 1 and 2 ★ , we know that is a function of 2 1 and 2 ★ . We express in these variables and compute d d( 21 ) when ★ → 0. We get thatBy maximizing the numerator with respect to 2 1 ≥ 0, we conclude that its value is always strictly smaller than 2 2 1 2 (-1 + 4 ) ≤ 0 for any finite 1 . Hence, d d( 2 1 ) < 0, which implies that the minimum is achieved only when 1 → ∞. In this case, if we express as a function of 1 and ★ and take 1 → ∞ and ★ → 0, we get →This case is very similar to the previous case. The only difference being that, because 1 4 < < 1, when we solvewe now get two possible solutions, namely,First notice that if = 0, both expressions give 1 = 0. Let us assume now that > 0. If we maximize the first expression with respect to 1 4 < < 1, we conclude that its value is always smaller than -((3 )/16) < 0, which implies that it is not a valid solution in the range 2 1 ≥ 0. If we minimize the second expression with respect to 1 4 < < 1, we conclude that its value is always non-negative, which implies it is a valid solution in the range 2 1 ≥ 0. We therefore conclude that the second expression is the only stationary point of in the range 2 1 ≥ 0 whether = 0 or not. Given that this is the only stationary point of in the range 2 1 ≥ 0, and given that the derivative d d( 21 ) ≤ 0 at 1 = 0 (which can be checked via substitution), we conclude that the critical point must be a global minimum.If we substitute the optimal values for 2 1 and 2 ★ = 0 in , we get = 4Case when 1 = 1:. As before, we start from knowing that the minimizer cannot be at the boundary of the domain. Therefore, since = 1, the only way that d d( 2 ) = 0 is if 2 = ∞, which corresponds to ★ = 0.If we substitute = 1+ 1 1 -1 into the left hand side of the last equality in (25), namely, 2 1 (-1 + 2 1 -)(-1 + ) + 2 1 (1 + ), and let ★ → 0, we get 0, which confirms the condition on when 1 = 1.We take → ∞ in to obtain,If we minimize d 2 d 2 over ∈ [-1, min(1, -1 + 2 1 )] and 1 ≥ 0, we obtain 2 1 /8 > 0. This shows that our objective is strictly convex in the range ∈ [-1, min(1, -1 + 2 1 )] and hence there is only one solution to d d = 0. The rational function d d has a denominator which is zero only if = -1 -2, both of which are outside the valid range for . Hence the unique solution to d d = 0 comes from the zeros of the numerator of d d . This numerator is (a constant times) a polynomial in whose coefficients are described in Theorem 11.

N EXPERIMENT ON A REAL DATASET

It is tempting to extrapolate our theory to practice. The scope of validity of our claims is rigorously stated in our theorems' assumptions and one should be cautions not to claim their applicability beyond this scope. In particular, we are not attempting to improve on existing empirical techniques to design AFs but rather seek a better understanding of an already popular model, the RFR model. Namely, we want to understand the effect that using optimal AFs has on the RFR model. Within the context of designing good, or optimal, AFs for practical settings with empirical/heuristic methods we refer the reader to Section C. Nonetheless, here we test some of our more general conclusions on real data. This appendix is referenced in the main text in Section 3.3. In this section, the data, and the fact that we do not work with infinite dimensions, are the only deviations from our theoretical setup. In particular, we work with an RFR model.We use the MNIST data Deng (2012) to train an RFR model that approximates a function , our ground truth object, defined as follows. For a given digit image with class ∈ {0, 1, . . . , 9}, we define ( ) = -5 + /9. Note that in the RFR model we are working with regressions, not classification. The MNIST data set has input dimensions = 28 × 28 = 784. For the test set we use 10000 random samples.In Figure 4 we plot the test error E has a function of 1 / 2 = / when we have = 4000 train samples and when the number of features ranges from 1 to 14250. Training is done with = 10 -7 . We do so in three different settings: (1) the AF is a fixed linear function; (2) the AF is a fixed ReLU;(3) the AF is a numerically optimized linear function. This AF is optimized as follows. For each value of 1 / 2 , we run a Bayesian optimization subroutine that minimizes the test error across all possible linear AFs. We note that, despite the fact that with a linear AF our model is linear, optimizing the test error via linear AFs is different from optimizing the weights in the second layer during training. In this third setting, for each value of 1 / 2 , we are working with a different AF, which is why we use the set notation {•} around in Figure 4 .We observe that we see a double descent curve phenomenon also for MNIST. This was previously known Belkin et al. (2019a) , as it was also previously known that double descent curves appear for more complex data sets and neural architectures, e.g. Nakkiran et al. (2021) . Unlike for the RFR theory, the interpolation threshold is not at 1 / 2 = 1, but it around 1 / 2 = 2. In this practical setting, and consistently with what we stated in our main observations for our theoretical setting, using different AFs affects the double descent curve phenomenon. In particular, using linear optimal AFs (one for each 1 / 2 ) can beat using a single ReLU function and seems to destroy the double descent curve phenomenon.The code to produce Figure 4 is in the following Github link: https://github.com/ Jeffwang87/RFR_AF. This code is also available in the supplementary zip file provided. To Published as a conference paper at ICLR 2023 2020b. We ran it using a MacBook Pro with 2.6 GHz 6-Core Intel Core i7 and 32 GB 2667 MHz DDR4. In this machine it takes about 3 hours to run.

INITIALIZATION

Our results assume that the features in the RFR model are sampled i.i.d. uniform on the ( -1)dimensional sphere of radius √ .In this section, we numerically examine if two other initializations of Θ lead to similar, or different, asymptotic mean squared test error. Specifically, we initialize Θ with either Xavier initialization (Glorot & Bengio, 2010) or Kaiming initialization (He et al., 2015) , and compare the resulting error curve ( when = 0) with the curve for the original initialization for the three different regimes in our paper.If the new error curves agree with the ones for the original initialization for some regime, we take that as evidence that our conclusion might hold for these initializations and that regime as well.In Figure 6-(A ) and (D), we see that in regime 1 and when = 0, there is a agreement between the three initializations. However, this is not the case for regimes 2 (Plot (B) and (E)) or 3 (Plot (C) and (F)). In regime 3, it is unclear if Xavier and Kaiming initialization agree for large values of . 

