TUNING FREQUENCY BIAS IN NEURAL NETWORK TRAINING WITH NONUNIFORM DATA

Abstract

Small generalization errors of over-parameterized neural networks (NNs) can be partially explained by the frequency bias phenomenon, where gradient-based algorithms minimize the low-frequency misfit before reducing the high-frequency residuals. Using the Neural Tangent Kernel (NTK), one can provide a theoretical analysis for training where data are drawn from constant or piecewise-constant probability densities. Since most training data sets are not drawn from such distributions, we use the NTK model and a data-dependent quadrature rule to theoretically quantify the frequency bias of NN training given nonuniform data. By replacing the loss function with a selected Sobolev norm, we can amplify or dampen the intrinsic frequency bias in NN training.

1. INTRODUCTION

Neural networks (NNs) are often trained in supervised learning on a small data set. They are observed to provide accurate predictions for a large number of test examples that are not seen during training. A mystery is how training can achieve small generalization errors in an overparameterized NN and a so-called "double-descent" risk curve (Belkin et al., 2019) . In recent years, a potential answer has emerged called "frequency bias," which is the phenomenon that in the early epochs of training, an overparameterized NN finds a low-frequency fit of the training data while higher frequencies are learned in later epochs (Rahaman et al., 2019; Yang & Salman, 2019; Xu, 2020) . In addition to generalization errors, it is often useful to understand the convergence rate for each spectral component of the data mismatch to study the robustness of the NN under noises. Currently, frequency bias is theoretically understood via the Neural Tangent Kernel (NTK) (Jacot et al., 2018) for uniform training data (Arora et al., 2019; Cao et al., 2019; Basri et al., 2019) and data distributed according to a piecewise-constant probability measure (Basri et al., 2020) . However, most training data sets in practice are highly clustered and not uniform. Yet, frequency bias is still observed during NN training (Fridovich-Keil et al., 2021) , even though the theory is absent. This paper proves that frequency bias is present when there is nonuniform training data by using a new viewpoint based on a data-dependent quadrature rule. We use this theory to propose new loss functions for NN training that accelerate its convergence and improve stability with respect to noise. An NN function is a map N : R d → R given by N (x) = W N L σ (W N L -1 σ (• • • (W 2 σ (W 1 x + b 1 ) + b 2 ) + • • • ) + b N L -1 ) + b N L , where W i ∈ R mi×mi-1 are weights, m 0 = d, b i ∈ R mi are biases, and N L is the number of layers. Here, σ is the activation function applied entry-wise to a vector, i.e., σ(a) j = σ(a j ). In this paper, we consider ReLU NNs, which are NNs for which σ is the ReLU function given by ReLU(t) = max(t, 0). Since ReLU(αt) = αReLU(t) for any α > 0, we assume that the input of the NN is normalized so that x ∈ S d-1 . To introduce a continuous perspective, we assume that there is an underlying target function g : S d-1 → R and the training samplers x i 's follow distribution µ(x). Given training data {(x i , y i )} n i=1 drawn from g, where x i ∈ S d-1 and y i ≈ g(x i ), our goal is to train the NN, in a way that is robust when the sampling of y i from g(x i ) is contaminated by noises, so that N uniformly approximates g on S d-1 . One standard training procedure is a gradient-based optimization algorithm that minimizes the residual in the squared L 2 (dµ) norm, i.e., Φ(W) = A d 2 ˆSd-1 |g(x) -N (x; W)| 2 dµ(x) ≈ A d 2n n i=1 |y i -N (x i ; W)| 2 , where A d is the Lebesgue measure of the hypersphere S d-1 and W represents the weights and bias terms. Similar to most theoretical studies investigating frequency bias, we restrict ourselves to 2-layer NNs (Arora et al., 2019; Basri et al., 2019; Su & Yang, 2019; Cao et al., 2019) . To study NN training, it is common to consider the dynamics of Φ(W) as one optimizes the coefficients in W. For example, the gradient flow of the NN weights is given by dW dt = -∂Φ ∂W . Define the residual z(x; W) = g(x) -N (x; W). Applying gradient flow with the population loss gives us dz(x; W) dt = -A d ˆSd-1 K(x, x ′ ; W)z(x ′ ; W)dµ(x ′ ), where K(x, x ′ ; W) = ∂N (x;W) ∂W , ∂N (x ′ ;W)

∂W

. Under the assumptions that the weights do not change much during training, one can consider the NTK given the underlying time-independent distribution of W, i.e., K ∞ (x, x ′ ) = E W [K(x, x ′ ; W)] (Du et al., 2018) . Based on eq. ( 2), one can understand the decay of the residual by studying the reproducing kernel Hilbert space (RKHS) through a spectral decomposition of the integral operator L defined by (Lz)(x) = ´K∞ (x, x ′ )z(x ′ )dµ(x ′ ). Most results in the literature require µ(x) to be the uniform distribution over the sphere so that the eigenfunctions of L are spherical harmonics and the eigenvalues have explicit forms (Cao et al., 2019; Basri et al., 2019; Scetbon & Harchaoui, 2021) . These explicit formulas for the eigenvalues and eigenfunctions of L rely on the Funk-Hecke theorem, which provides a formula allowing one to express an integral over a hypersphere by an integral over an interval (Seeley, 1966) . The frequency bias of NN training can be explained by the fact that low-degree spherical harmonic polynomials are eigenfunctions of L associated with large eigenvalues (Basri et al., 2019) . Thus, for uniform training data, the optimization of the weights and biases of an NN tends to fit the low-frequency components of the residual first. When µ(x) is nonuniform, it is difficult to analyze the spectral properties of L and thus the frequency bias properties of NN training. Since the Funk-Hecke formula no longer holds, there are only a few special cases where frequency bias is understood (Williams & Rasmussen, 2006, Sec. 4.3) . Although one may derive asymptotic bounds for the eigenvalues (Widom, 1963; 1964; Bach & Jordan, 2002) , it is hard to obtain formulas for the eigenfunctions, and one usually relies on numerical approximations (Baker, 1977) . For the ReLU-based NTK, Basri et al. (2020) provided explicit eigenfunctions assuming that the µ(x) is piecewise constant on S 1 , but this analysis does not generalize to higher dimension. To study the frequency bias of NN training, one needs to understand both the eigenvalues and eigenfunctions of L, and this remains a significant challenge for a general µ(x) due to the absence of the Funk-Hecke formula. To overcome this challenge, we take a different point-of-view. While it is standard to discretize the integral in eq. ( 1) using a Monte Carlo-like average, we discretize it using a data-dependent quadrature rule where the nodes are at the training data. That is, we investigate the frequency bias of NN training when minimizing the residual in the standard squared L 2 norm: Φ(W) = 1 2 ˆSd-1 |g(x) -N (x; W)| 2 dx ≈ 1 2 n i=1 c i |y i -N (x i ; W)| 2 , where c 1 , . . . , c n are the quadrature weights associated with the (nonuniform) input data x 1 , . . . , x n . If x 1 , . . . , x n are drawn from a uniform distribution over the hypersphere, then one can select c i = A d /n for 1 ≤ i ≤ n; otherwise, one can choose any quadrature weights so that the integration rule is accurate (see section 4.2). If x 1 , . . . , x n are drawn independently at random from µ(x), then it is often reasonable to select c i = 1/(np(x i )), where dµ(x) = p(x)dx. While c 1 , . . . , c n depend on x 1 , . . . , x n , the continuous expression for Φ(W) is always unaltered in eq. ( 3). Therefore, we can use the Funk-Hecke formula to analyze the eigenvalues and eigenfunctions of L defined by ( Lz)(x) = ´Sd-1 K ∞ (x, x ′ )z(x ′ )dx ′ , revealing the frequency bias. We address that by choosing eq. (3) as a loss function instead of eq. ( 1), we are enforcing the frequency bias of NN training, whereas eq. ( 1) does not ensure such spectral property (see Section 6.1 for an illustration). To further tune the NN frequency bias during training, we also propose to minimize the residual in a squared Sobolev H s norm for a carefully selected s ∈ R. Unlike the L 2 norm (the case of s = 0), the H s norm for s ̸ = 0 has its own frequency bias. For s > 0, H s penalizes high frequencies more than low, while for s < 0, low frequencies are penalized the most. We implement the squared H s norm using a quadrature rule, which induces a different integral operator L s . We analyze the eigenvalues and eigenfunctions of L s , and consequently, the frequency bias in the NN training using the Funk-Hecke formula. Given our new understanding of frequency bias, we select s so that the H s norm amplifies, dampens, counterbalances, or reverses the natural frequency bias from an overparameterized NN training. Contributions. Here are our three main contributions to analyzing and tuning NN frequency bias. (1) From our quadrature point-of-view, we analyze the frequency bias in training a 2-layer overparameterized ReLU NN with nonuniform training data. In Theorem 2, we show that the theory of frequency bias in Basri et al. (2019) for uniform training data continues to hold in the nonuniform case up to quadrature errors. In Theorem 3, we provide control of the quadrature errors. (2) We use our understanding of frequency bias to modify the usual squared L 2 loss function to a squared H s norm. By selecting s, we can amplify or dampen the intrinsic frequency bias in NN training, accelerate the convergence of gradient-based optimization procedures, and separate out noises of particular frequencies. (3) A potential issue with the H s norm is the difficulties of implementing with high-dimensional training data. Using an image dataset of dimension 28 2 = 784, we show how to use an encoderdecoder architecture to implement a practical version of the squared H s norm loss and adjust the frequency bias in NN training to suppress noises of different frequencies (see Figure 4 ).

2. PRELIMINARIES AND NOTATION

For d > 1, let g : S d-1 → R be a square-integrable function defined on S d-1 . The function g can be expressed in a spherical harmonic expansion given by g(x) = ∞ ℓ=0 N (d,ℓ) p=1 ĝℓ,p Y ℓ,p (x), ĝℓ,p = ˆSd-1 g(x)Y ℓ,p (x)dx, where Y ℓ,p is the spherical harmonic basis function of degree ℓ and order p (Dai & Xu, 2013) . Here, N (d, ℓ) is the number of spherical harmonic functions of degree ℓ so that N (d, 0) = 1 and N (d, ℓ) = (2ℓ+d-2)Γ(ℓ+d-2) Γ(ℓ+1)Γ(d-1) for ℓ ≥ 1. The set {Y ℓ,p } ℓ≥0,1≤p≤N (d,ℓ) is an orthonormal basis for L 2 (S d-1 ). Let H d ℓ be the span of {Y ℓ,p } N (d,ℓ) p=1 , and Π d ℓ = ℓ j=0 H d j be the space of spherical harmonics of degree ≤ ℓ.

Given distinct training data {x i } n

i=1 from S d-1 and evaluations y i = g(x i ) for 1 ≤ i ≤ n, our goal is to understand the intrinsic frequency bias behavior of training a 2-layer ReLU NN given by N (x) = 1 √ m m r=1 a r ReLU(w ⊤ r x + b r ), ReLU(t) = max(t, 0), where m is the number of hidden neurons, w 1 , . . . , w m and a 1 , . . . , a m are weights, and b 1 , . . . , b m are biases. We use the same setup as in (Basri et al., 2020) : assuming that (1) w 1 , . . . , w m are initialized independently and identically distributed (i.i.d.) from Gaussian random variables with covariance matrix κ 2 I, where κ > 0, (2) the biases are initialized to 0, and (3) a 1 , . . . , a m are initialized i.i.d. as +1 with probability 1 2 and -1 otherwise, and {a r } are not updated during training. We use a gradient-based optimization scheme to train for the weights and biases and aim to minimize the residual defined by a symmetric positive definite (SPD) matrix P, which can be written as Φ P (W) = 1 2 (y -u) ⊤ P(y -u), where y = (g(x 1 ), . . . , g(x n )) ⊤ and u = (N (x 1 ), . . . , N (x n )) ⊤ . For example, we have P = A d n -1 I in eq. ( 1) and P = diag(c 1 , . . . , c n ) in eq. ( 3). Recall that W represents all the weights and biases of the NN. Given the loss function, we train the NN based on the gradient descent algorithm: w r (k + 1) -w r (k) = -η ∂Φ P ∂w r , b r (k + 1) -b r (k) = -η ∂Φ P ∂b r , 1 ≤ r ≤ m, ( ) where k is the iteration number and η > 0 is the learning rate. The matrix P induces an inner product ⟨ξ, ζ⟩ P = ξ ⊤ Pζ, which leads to a finite-dimensional Hilbert space with the norm ∥ξ∥ P = ⟨ξ, ξ⟩ P . Given a matrix A ∈ R n×n , we define its operator norm ∥A∥ P = sup ξ∈R n ,∥ξ∥ P =1 ∥Aξ∥ P . We also define a finite positive number that depends on P: M P = sup ξ∈R n ,∥ξ∥ 2 =1 ∥ξ∥ P = sup ξ∈R n \{0} ξ ⊤ Pξ ξ ⊤ ξ = sup ζ∈R n \{0} ζ ⊤ P 1/2 PP 1/2 ζ ζ ⊤ P 1/2 P 1/2 ζ = sup ζ∈R n ,∥ζ∥ P =1 ∥Pζ∥ 2 . (8) Note that by the third expression in eq. ( 8), we also have M P = P 1/2 2 = ∥P∥ 2 . Furthermore, we define the matrix H ∞ ∈ R n×n by H ∞ ij = E w∼N (0,κ 2 I) x ⊤ i x j + 1 2 I {w ⊤ xi,w ⊤ xj ≥0} = (x ⊤ i x j + 1)(π -arccos(x ⊤ i x j )) 4π . Note that due to the introduction of the biases, H ∞ is slightly different than the one in (Du et al., 2018; Arora et al., 2019) . In fact, in contrast with (Du et al., 2018) , H ∞ defined in eq. ( 9) is SPD regardless of the distribution of the training data, as shown in the supplementary material. Proposition 1. If x 1 , . . . , x n are distinct, then H ∞ in eq. ( 9) is SPD. As a consequence of Proposition 1, the matrix H ∞ P has positive eigenvalues, which we denote by λ n-1 ≥ • • • ≥ λ 0 > 0. In fact, let Λ = diag(λ 0 , . . . , λ n-1 ). Then, H ∞ P is self-adjoint in (R n , ⟨•, •⟩ P ) (see appendix A) and can be diagonalized as H ∞ P = P -1/2 P 1/2 H ∞ P 1/2 P 1/2 = P -1/2 V -1 ΛVP 1/2 , ( ) where P 1/2 H ∞ P 1/2 = V -1 ΛV is SPD and therefore diagonalizable. One can view H ∞ as coming from sampling a continuous kernel K ∞ : S d-1 × S d-1 → R given by K ∞ (x, y) = K ∞ (⟨x, y⟩) = (⟨x, y⟩ + 1)(π -arccos(⟨x, y⟩)) 4π , where ⟨•, •⟩ is the ℓ 2 inner-product. The eigenvalues and eigenfunctions of K ∞ are known explicitly via the Funk-Hecke formula (Basri et al., 2019) : ˆSd-1 K ∞ (x, y)Y ℓ,p (y)dy = µ ℓ Y ℓ,p (x), ℓ ≥ 0. ( ) The explicit formulas for µ ℓ with ℓ ≥ 0 are given in the supplementary material. We find that µ ℓ > 0 for all ℓ and µ ℓ is asymptotically O(ℓ -d ) for large ℓ (Basri et al., 2019; Bietti & Mairal, 2019) .

3. TRAINING CONVERGENCE WITH A GENERAL LOSS FUNCTION

Given the NN model in eq. ( 5) and a general loss function Φ P in eq. ( 6), we are interested in the convergence rate of NN training. We study this by analyzing the convergence rate for each harmonic component. We start by presenting a convergence result that holds for any SPD matrix P. It says that up to an error ϵ, which can be made arbitrarily small by taking κ small enough and m large enough, the residual of the NN at the kth iteration is approximately (I -2ηH ∞ P) k y. Theorem 1. In eq. ( 5), suppose that w 1 , .  (k) = (N k (x 1 ), . . . , N k (x n )) , where N 0 is the initial NN function. Let an accuracy goal 0 < ϵ < 1, a probability of failure 0 < δ < 1, and a time span T > 0 be given. Then, there exist constants C 1 , C 2 > 0 that depend only on the dimension d such that if 0 < η ≤ 1/(2M 2 P n) (see eq. (8)), κ ≤ C 1 ϵM -1 P δ/n, and m satisfies m ≥ C 2 M 6 P n 3 κ 2 ϵ 2 λ -4 0 +η 4 T 4 ϵ 4 + M 4 P n 2 log(n/δ) ϵ 2 λ -2 0 +η 2 T 2 ϵ 2 , ( ) then with probability ≥ 1 -δ, we have y -u(k) = (I -2ηH ∞ P) k y + ϵ(k), ∥ϵ(k)∥ P ≤ ϵ, 0 ≤ k ≤ T. Here, H ∞ follows eq. ( 9), y = (g(x 1 ), . . . , g(x n )) ⊤ , and λ 0 is the smallest eigenvalue of H ∞ P. We defer the proof to the supplementary material, which uses techniques from (Su & Yang, 2019) . The main idea behind the proof is that I -2ηH ∞ P is close to the transition matrix for the residual yu(k) when m is large. By taking κ small, we can control the size of u(0) and therefore obtain yu(k) ≈ (I -2ηH ∞ P) k (y -u(0)) ≈ (I -2ηH ∞ P) k y. As η decreases, the gradient descent algorithm gets closer to the gradient flow algorithm (Du et al., 2018) , which allows us to more accurately quantify the frequency bias (see section 4). For a fixed n, if y is an eigenvector associated to λ 0 , then (I -2ηH ∞ P) k y = exp(-k log((1 -2ηλ 0 ) -foot_0 )) ∥y∥, where log((1 -2ηλ 0 ) -1 ) is called the convergence rate (Su & Yang, 2019) . If we assume λ min (P) = O(1/n), which is the case of eq. ( 1), then as shown in (Su & Yang, 2019, Thm. 2) and (Nguyen et al., 2021) , we expect that λ 0 → 0 as n → ∞. Hence, as n → ∞, there exists a labeling y on the training data, depending on n, that makes the convergence rate vanish. However, as suggested by (Su & Yang, 2019; Cao et al., 2019) , for a fixed bandlimited target function g, its convergence rate in early epochs stays constant as n → ∞. We make similar observations in section 4.

4. FREQUENCY BIAS WITH AN L 2 -BASED LOSS FUNCTION

The mean-squared loss function in eq. ( 1) corresponds to setting c i = A d /n for 1 ≤ i ≤ n in eq. ( 3). When µ is uniform, eq. ( 1) and eq. ( 3) are equivalent; when µ is nonuniform, we introduce a quadrature rule with nodes x 1 , . . . , x n and weights c 1 , . . . , c n to approximate the L 2 loss function eq. ( 3). The weights are selected so that for low-frequency functions f : S d-1 → R, the quadrature error E c (f ) = ˆSd-1 f (x)dx - n i=1 c i f (x i ) is relatively small. 1 A reasonable quadrature rule has positive weights for numerical stability and satisfies n i=1 c i = A d so that it exactly integrates constants. The continuous squared L 2 loss function based on the Lebesgue measure is then discretized to be the square of a weighted discrete ℓ 2 norm (see eq. ( 3)). Hence, we take P = D c = diag(c 1 , . . . , c n ), which is SPD as the c i 's are positive. For a vector v ∈ R n , we write ∥v∥ 2 c = v ⊤ D c v and set c max = max 1≤i≤n {c i }. We now apply Theorem 1 to study the frequency bias of NN training with the squared L 2 loss eq. (3). We state these results in terms of quadrature errors. Recall our continuous setup where we assume that the training data is taken from a function g : S d-1 → R so that y i = g(x i ) for 1 ≤ i ≤ n. We further assume that g is bandlimited with bandlimit L where g = g 0 + • • • + g L and g ℓ ∈ H d ℓ for 0 ≤ ℓ ≤ L. With 1 ≤ i ≤ n, j, ℓ ≥ 0, and 1 ≤ p ≤ N (d, ℓ), we define quadrature errors as e a j,ℓ,p = E c (g j Y ℓ,p ), e b i,ℓ,p = E c (K ∞ (x i , •)Y ℓ,p ), e c j,ℓ = E c (g j g ℓ ), e d i,ℓ = E c (K ∞ (x i , •)g ℓ ). (16) We interpret g ℓ = 0 when ℓ > L and function products are interpreted as pointwise products.

4.1. A FREQUENCY-BASED FORMULA FOR THE TRAINING ERROR

We obtain a similar result to (Arora et al., 2019, Thm. 4 .1) when using loss function Φ in eq. ( 3). Instead of expressing the training error using the spectrum of H ∞ D c , we directly relate the training error to the frequency components of g and the eigenvalues of the continuous kernel K ∞ . Theorem 2. Under the same setup and assumptions of Theorem 1, let P = D c and M P = √ c max . If g : S d-1 → R is a bandlimited function with bandlimit L and 1 -2ηµ ℓ > 0 for all 0 ≤ ℓ ≤ L (see eq. ( 12)), then with probability ≥ 1 -δ we have ∥y-u(k)∥ c = L ℓ=0 (1-2ηµ ℓ ) 2k ∥g ℓ ∥ 2 L 2 +ε 1 (k)+ε 2 +ε 3 (k), |ε 3 (k)| ≤ ϵ, 0 ≤ k ≤ T, ( ) where ε 1 (k) and ε 2 satisfy |ε 1 (k)| ≤ L j=0 L ℓ=0 (1-2ηµ j ) k (1-2ηµ ℓ ) k e c j,ℓ , |ε 2 | ≤ L ℓ=0 √ A d µ ℓ max 1≤i≤n e d i,ℓ . The proof of Theorem 2 is postponed to the supplementary material. The idea is that by Theorem 1, we know yu (k) = (I -2ηH ∞ D c ) k y + ε 3 (k). Using Funk-Hecke formula and quadrature, we have that 1 -2ηµ ℓ are roughly the eigenvalues of I -2ηH ∞ D c and y ℓ = (g ℓ (x 1 ), . . . , g ℓ (x n )) ⊤ are associated eigenvectors. Hence, yu(k) ≈ L ℓ=0 (1 -2ηµ ℓ ) k y ℓ . This can be made precise by introducing ε 2 . Finally, up to some quadrature error ε 1 , we have ⟨g j , g ℓ ⟩ L 2 ≈ A d n -1 y ⊤ j y ℓ , which gives us eq. ( 17). Since n i=1 c i = A d , For a fixed data distribution µ, we expect that c max = O(n -1 ) as n → ∞ so that η does not decay as n → ∞. Up to a quadrature error, ∥y -u(k)∥ c is close to the L 2 norm of the residual function g -N k . Explicit formulas for the eigenvalues {µ ℓ } (see eq. ( 12)) are given in (Basri et al., 2019) , and it was shown that µ ℓ = O(ℓ -d ) (Bietti & Mairal, 2019) . Theorem 2 demonstrates the frequency bias in NN training as the rate of convergence for frequency 0 ≤ ℓ ≤ L is 1 -2ηµ ℓ , which is close to 1 when ℓ is large. As η → 0, we have (1 -2ηµ ℓ ) 2t/η → e -4µ ℓ t , which gives the convergence rate for frequency ℓ using gradient flow. Therefore, we expect that NN training approximates the low-frequency content of g faster than its high-frequency one, which is similar to the case of training with uniform data (Basri et al., 2019) .

4.2. ESTIMATING THE QUADRATURE ERRORS

We now quantify the quadrature errors in Theorem 2. If we can design a quadrature rule at the training data x 1 , . . . , x n such that the quadrature error satisfies |E c (h)| = ˆSd-1 h(x)dx - n i=1 c i h(x i ) ≤ γ n,ℓ ∥h∥ L ∞ , h ∈ Π d ℓ , ℓ ≥ 0, for some constant γ n,ℓ ≥ 0, then we can bound the terms in eq. ( 16). We expect that for each fixed ℓ, γ n,ℓ → 0 as n → ∞ as this is saying that integrals can be calculated more accurately for a denser set of quadrature nodes. In practice, it can require a large amount of training data to make γ n,ℓ small when d is large. Under the assumption that our quadrature rule satisfies eq. ( 18) with reasonably small γ n,ℓ 's when ℓ is small, we can bound the quadrature errors appearing in Theorem 2. Theorem 3. Under the same assumptions of Theorem 2, and that the quadrature rule satisfies eq. (18), there exist constants C 1 , C 2 > 0 only depending on the dimension d such that the terms |ε 1 (k)| and |ε 2 | in Theorem 2 satisfy |ε 1 (k)| ≤ C 1 L 3 ℓ +L 2 γ n,ℓ max 0≤j≤L ∥g j ∥ L ∞ , |ε 2 | ≤ C 2 L 2 ℓ +Lγ n,ℓ max 0≤j≤L ∥g j ∥ L ∞ L j=0 µ -1 j for all k ≥ 0, ℓ ≥ 1, where g = g 0 + • • • + g L with g j ∈ H d j and γ n,ℓ satisfies eq. (18). The proof is in the supplementary material. Theorem 3 states that ε 1 (k) and ε 2 can be made arbitrarily small if the quadrature errors converge to 0 as the number of nodes n → ∞. In particular, if there is a sequence {ℓ n } that increases to ∞ such that the quadrature rule is exact for all functions h ∈ Π d ℓn , i.e., E c (h) = 0, where ℓ n → ∞ (see e.g. (Mhaskar et al., 2000) ), the rates of convergence of ε 1 (k) and ε 2 are both O(1/ℓ n ) for a fixed g. Without the quadrature being exact, we still have nice convergence provided the quadrature errors are small, as the following corollary shows. Corollary 1. Suppose there exists a sequence ℓ n → ∞ such that γ n,ℓn → 0 as n → ∞. Then, for a fixed L ≥ 0, we have max k≥0,g∈Π d L |ε 1 (k)| / ∥g∥ 2 L 2 and max g∈Π d L |ε 2 | / ∥g∥ L 2 → 0 as n → ∞. Corollary 1 shows that as n increases, the quadrature errors ε 1 (k) and ε 2 converge to zero. Moreover, this convergence is uniform in the sense that it does not depend on the specific choice of g ∈ Π d L . Here, we normalize ε 1 (k) and ε 2 by ∥g∥ 2 L 2 and ∥g∥ L 2 , respectively, to obtain the "relative" quadrature errors that do not scale when g is multiplied by a scalar (see (17)).

5. FREQUENCY BIAS WITH A SOBOLEV-NORM LOSS FUNCTION

The frequency bias during the training of an overparameterized NN has several consequences. In many situations, worse convergence rates for high-frequency components of a function are beneficial since the NN training procedure is less sensitive to the oscillatory noise in the data, acting as a low-pass filter. This significantly improves the generalization error of overparameterized NNs. However, in other situations, NN training struggles to accurately learn the high-frequency content of g, resulting in slow convergence. To precisely control the frequency bias of NN training, we propose to train a NN with a loss function that has intrinsic spectral bias. Let D ′ (S d-1 ) be the space of distributions on S d-1 . Given s ∈ R, consider L s : D ′ (S d-1 ) → D ′ (S d-1 ), where L s = I + (-∆) 1/2 s and ∆ is the Laplace-Beltrami operator on the sphere. We follow (Barceló et al., 2021) and define the spherical Sobolev space H s (S d-1 ) = {f ∈ D ′ (S d-1 ) : L s f ∈ L 2 (S d-1 )}, equipped with a norm equivalent to eq. (1.24) in (Barceló et al., 2021) , ∥f ∥ 2 H s (S d-1 ) = ∞ ℓ=0 N (d,ℓ) p=1 (1 + ℓ) 2s fℓ,p 2 , ( ) where fℓ,p are the spherical harmonic coefficients of f (see eq. ( 4)) and N (d, ℓ) is given in section 2. We propose to set the loss function to be 1 2 ∥g -N ∥ 2 H s in replace of 1 2 ∥g -N ∥ 2 L 2 in eq. (3). When s = 0, it reduces to the L 2 norm. If s > 0, the high-frequency spherical harmonic coefficients are amplified by (1 + ℓ) 2s . The high-frequency components of the residual are then penalized more in the loss function, and one can expect the NN training to learn the high-frequency components faster with the squared H s loss function than the case of eq. ( 3). Similarly, if s < 0, the high-frequency spherical harmonic coefficients are dampened by (1 + ℓ) 2s and one expects the NN training captures the high-frequency components of the residual more slowly with the squared H s loss function. However, when s < 0, the training is more robust to the high-frequency noise in the data. By tuning the parameter s, we can control the frequency bias in NN training (see Theorem 4). The choice of s for a particular application can be determined from theory or by cross-validation. First, we justify that the residual function is indeed in H s . Since we assume that g is bandlimited, g ∈ H s for all s ∈ R. Proposition 2 shows that we could consider s < 3/2 for ReLU-based NNs. Proposition 2. Suppose N : S d-1 → R is a 2-layer ReLU NN (see eq. ( 5)). Then, we have N ∈ H s (S d-1 ) for all s < 3/2. Moreover, if s ≥ 3/2, N ∈ H s (S d-1 ) if and only if N is affine. The proof is deferred to the supplementary material. When s ≥ 3/2, the residual function N -g may not be in H s . However, we can still truncate the sum in eq. ( 19) to a maximum frequency ℓ max to train the NN, although the sum can no longer be interpreted as an approximation of some Sobolev norm at the continuous level. We discretize the Sobolev-based loss function as Φ s (W) = 1 2 ℓmax ℓ=0 N (d,ℓ) p=1 (1 + ℓ) 2s n i=1 c i Y ℓ,p (x i )(g -N )(x i ) 2 = 1 2 (y -u) ⊤ P s (y -u), where u and y follow eq. ( 6), and P s = ℓmax ℓ=0 N (d,ℓ) p=1 (1+ℓ) 2s P ℓ,p , P ℓ,p = a ℓ,p a ⊤ ℓ,p , and (a ℓ,p ) i = c i Y ℓ,p (x i ). We assume that P s is SPD, which requires that (ℓ max + 1) 2 ≥ n. Next, we present our convergence theorem for Sobolev training. Theorem 4. Suppose g ∈ Π d L and Φ s is the loss function in eq. ( 20), where P s is SPD and ℓ max ≥ L. Under the assumptions of Theorem 1, if 1 -2ηµ ℓ (1 + ℓ) 2s > 0 for all 0 ≤ ℓ ≤ L, then with probability ≥ 1 -δ over the random initialization, we have y -u(k) = L ℓ=0 1 -2ηµ ℓ (1 + ℓ) 2s k y ℓ + ε 1 + ε 2 (k), ∥ε 2 (k)∥ Ps ≤ ϵ, 0 ≤ k ≤ T, ( ) where y ℓ = (g ℓ (x 1 ), . . . , g ℓ (x n )) ⊤ and ε 1 satisfies ∥ε 1 ∥ Ps ≤ L ℓ=0 µ -1 ℓ ∥ε ℓ 1 ∥ Ps , (ε ℓ 1 ) i = e d i,ℓ + ℓmax j=0 (1+j) 2s (1+ℓ) 2s N (d,j) p=1 e a ℓ,j,p µ j Y j,p (x i )+e b i,j,p . Compared to Theorem 2, Theorem 4 says that up to the level of quadrature errors, the convergence rate of the degree-ℓ component is 1 -2ηµ ℓ (1 + ℓ) 2s . In particular, since µ ℓ = O(ℓ -d ), there is an s * > 0, which depends on d, such that (1 + ℓ) 2s * µ ℓ can be bounded from above and below for all ℓ ≥ 0 by positive constants that are independent of ℓ. This means for any s > s * , we expect to reverse the frequency bias behavior of NN training. Figure 2 shows the reversal of frequency bias as s increases from -1 to 4 (see section 6.1). Published as a conference paper at ICLR 2023 = 1 (blue), 5 (red) and 9 (yellow) against the number of iterations for loss function Φ (solid lines) and Φ (dashed lines). Right: the number of iterations for the NN training to achieve a fixed loss threshold in learning g ℓ (x) = sin(ℓθ) for 3 ≤ ℓ ≤ 10 given the loss function Φ. The black line represents the O(ℓ 2 ) rate based on the analysis in (Basri et al., 2019) .  1 10 -1 10 -2 10 -3 �-�--�--�--�--�--�--�--�--�--� -1 -0.5 0 0.5 1 1.5

6. EXPERIMENTS AND DISCUSSION

This section presents three experiments with synthetic and real-world datasets to investigate the frequency bias of NN training using squared L 2 loss and squared H s loss. The first two experiments learn functions on S 1 and S 2 , respectively. In the third test, we train an autoencoder on the MNIST dataset for a denoising task. One can find more details in the supplementary material.

6.1. LEARNING TRIGONOMETRIC POLYNOMIALS ON THE UNIT CIRCLE

First, we consider learning a function on S 1 . We create a set of n = 1140 nonuniform data {x i } n i=1 , as seen in Figure 1 , and compute the quadrature weights {c i } n i=1 for the loss function Φ in eq. ( 3) by solving a constrained quadratic program (see appendix F.1). We train a 2-layer ReLU NN to learn g(x) = g(θ) = 9 ℓ=1 sin(ℓθ), where x = (cos θ, sin θ). We define the frequency loss | N (ℓ) -ĝ(ℓ)| where N and ĝ are the Fourier coefficients of N and g, respectively (see appendix F.1). In Figure 1 , we plot the frequency loss for ℓ = 1, 5, 9 in different colors to illustrate how well the NN fits each frequency component. The solid and dashed lines correspond to the loss function Φ in eq. ( 1) and Φ in eq. ( 3), respectively. Our observations collaborate with the theoretical statements in Theorem 2. Figure 1 also shows that it takes asymptotically O(ℓ 2 ) iterations to learn the ℓth frequency sin(ℓθ) given the loss function Φ. A similar plot appears in Basri et al. (2019) for uniform training data. We also use the squared H s norm as the loss function to learn g. After 5000 epochs, we plot the ℓth frequency loss with ℓ ranging from 1 (blue) to 9 (red) in Figure 2 , given different s values. As s increases, the higher-frequency components are learned faster. When s > 2, the frequency bias is reversed in the sense that higher-frequency parts are learned faster than the lower-frequency ones rather than a low-frequency bias under the squared L 2 loss (see Theorem 2). The gradually changing "rainbow" in Figure 2 shows that the smoothing property of an overparameterized NN can be compensated by the H s loss function for large enough s, corroborating Theorem 4. 6.2 LEARNING SPHERICAL HARMONICS ON THE UNIT SPHERE Similar to the previous example in S 1 , we design an experiment on S 2 . We utilize a data set {x i } 2500 i=1 in (Wright & Michaels, 2015) , which comes with carefully designed positive quadrature weights {c i } 2500 i=1 . We test the squared H s loss function in NN training with a target function g(x) = 30 ℓ=0,ℓ even Y ℓ,0 defined on S 2 that involves many high-frequency components. The results are shown in Figure 3 with different s values. The natural low-frequency bias of the NN in the case of L 2 -based training (i.e., s = 0) is enhanced when s = -1, and is totally reversed when s = 2.5.

6.3. AUTOENCODER ON THE MNIST DATASET

The idea of Sobolev training is also useful for high-dimensional training data. In Figure 4 , we present the results of the autoencoder for image denoising using the MNIST dataset (LeCun et al., 2010) . The outputs of the autoencoder are presented when trained with the squared H s norm as the loss function. We contaminate the dataset with random low-frequency noise (top row) and highfrequency noise (bottom row). When high-frequency noise is present, the H s loss function generally performs better with s < 0, while the case of s > 0 helps image deblurring when the input image suffers from low-frequency noise. This corroborates our discussion in Section 5. In appendix G, we theoretically justify this phenomenon by studying the frequency bias in operator learning.

7. CONCLUSIONS

A frequency bias phenomenon is observed in NN training with nonuniform training data. Instead of the standard mean-squared loss function in eq. ( 1), we propose the use of a different loss function in eq. ( 3), which involves quadrature weights and has a natural continuous analog. With eq. ( 3), we rigorously analyze frequency bias with nonuniform training data using the Funk-Hecke formula. By changing the loss function to a squared L 2 -type Sobolev norm, we can control the frequency bias in NN training, which can accelerate NN training convergence and improve robustness under noises.

SUPPLEMENTARY MATERIAL

This is the supplementary material for the paper titled "Tuning Frequency Bias in Neural Network Training with Nonuniform Data." The supplementary material is organized as follows. In Appendix A, we recall the notations used in the paper and introduce additional concepts for our analysis. In Appendix B, we prove that H ∞ is a symmetric and positive definite matrix. In Appendix C, we prove Theorem 1 of the paper, while the proofs of results in Section 4 and Section 5 are given in Appendix D and Appendix E, respectively. In Appendix F, we provide further details of our three experiments. The observation made in the third experiment is then briefly justified in Appendix G, in which we study frequency bias in a general operator learning setting. In Appendix H, we formally discuss the computation of positive quadrature weights.

A PRELIMINARIES AND NOTATION

For d > 1, let g : S d-1 → R be a square-integrable function defined on S d-1 . The function g has a spherical harmonic expansion given in Section 2. We denote the space of harmonic functions of degree ℓ by H d ℓ , which is the span of {Y ℓ,p } N (d,ℓ) p=1 . We further denote the space of spherical harmonics of degree ≤ ℓ by Π d ℓ = ℓ j=0 H d j . Given distinct training data {x i } n i=1 from S d-1 and evaluations y i = g(x i ) for 1 ≤ i ≤ n, our goal is to understand the intrinsic frequency bias behavior of training a 2-layer ReLU NN given in eq. ( 5). It is important for the theory that we initialize the weights as independently and identically distributed (iid) Gaussian random variables with a covariance matrix κ 2 I, the bias terms are initialized to zero, and the coefficients, i.e., a 1 , . . . , a m , are initialized iid as +1 with probability 1/2 and -1 otherwise. During the training process, the values of {a r } are not updated. We train with the loss function given in eq. ( 6) so that the gradient descent algorithm for NN training is given by eq. ( 7). An important object in understanding the frequency bias of NN training is the symmetric and positive definite matrix H ∞ ∈ R n×n in eq. ( 9). Since H ∞ and P are symmetric positive definite matrices (see Proposition 1), H ∞ P has positive real eigenvalues. To see this, note that H ∞ P = H ∞ P 1/2 P 1/2 = P -1/2 (P 1/2 H ∞ P 1/2 )P 1/2 . This means that H ∞ P and P 1/2 H ∞ P 1/2 are similar. Since the matrix P 1/2 H ∞ P 1/2 is symmetric positive definite, H ∞ P has positive eigenvalues. We denote the eigenvalues of H ∞ P by λ n-1 ≥ • • • ≥ λ 0 > 0, which partially govern the frequency bias phenomena. It is worth mentioning that although H ∞ P is not symmetric, it is self-adjoint in the inner product space induced by P because ⟨H ∞ Pξ, ζ⟩ P = (H ∞ Pξ) ⊤ Pζ = ξ ⊤ P(H ∞ Pζ) = ⟨ξ, H ∞ Pζ⟩ P . It is convenient to analyze the eigenvalues and eigenvectors of 11). The key is the Funk-Hecke formula (Seeley, 1966) . H ∞ P via the zonal kernel K ∞ : S d-1 × S d-1 → R given in eq. ( Theorem 5 (Funk-Hecke). Suppose K : [-1, 1] → R is measurable and K(t)(1 -t 2 ) (d-3)/2 is integrable on [-1, 1]. Then, for any h ∈ H d ℓ , we have ˆSd-1 K(⟨ξ, ζ⟩)h(ξ)dξ = A d ˆ1 -1 K(t)P ℓ,d (t)(1 -t 2 ) (d-3)/2 dt h(ζ), ζ ∈ S d-1 , ( ) where P ℓ,d is the ultraspherical polynomial given by P ℓ,d (t) = (-1) ℓ Γ((d -1)/2) 2 ℓ Γ(ℓ + (d -1)/2)(1 -t 2 ) (d-3)/2 d ℓ dt ℓ (1 -t 2 ) ℓ+(d-3)/2 . ( ) Applying Theorem 5, we have that ˆSd-1 K ∞ (x, y)h(y)dσ(y) = µ ℓ h(x), h ∈ H d ℓ , where µ ℓ > 0, ∀ℓ, given by (Basri et al., 2019 ) µ ℓ =                    1 2 C d 1 (0) 1 (d-1)2 d d-1 d-1 2 + 2 d-2 (d-1)( d-2 d-1 2 ) -1 2 d-3 2 p=0 (-1) p d-3 2 p 1 2p+1 , ℓ = 0, 1 2 C d 1 (1) ℓ+ d-3 2 p=⌈ ℓ 2 ⌉ C d 2 (p, 1) 1 2(2p+1) + 1 4p 1 -1 2 2p 2p p , ℓ = 1, 1 2 C d 1 (ℓ) ℓ+ d-3 2 p=⌈ ℓ 2 ⌉ C d 2 (p, ℓ) -1 2(2p-ℓ+1) + 1 2(2p-ℓ+2) 1-1 2 2p-ℓ+2 2p-ℓ+2 2p-ℓ+2 2 , ℓ ≥ 2 even, 1 2 C d 1 (ℓ) ℓ+ d-3 2 p=⌈ ℓ 2 ⌉ C d 2 (p, ℓ) 1 2(2p-ℓ+1) 1 - 1 2 2p-ℓ+1 2p-ℓ+1 2p-ℓ+1 2 , ℓ ≥ 2 odd for d ≥ 3, d odd, where C d 1 (ℓ) = (-1) ℓ 2π (d-1)/2 (d -1)2 ℓ Γ(ℓ + (d -1)/2) , C d 2 (p, ℓ) = (-1) p ℓ + d-3 2 p (2p)! (2p -ℓ)! . Here, the exclamation mark means a factorial and • • denotes the binomial coefficient. Given x, w r , and b r in eq. ( 5), we write x = 1 √ 2 (x, 1) ∈ S d and wr = (w r , b r ) ∈ R d+1 . Therefore, we have ReLU(w ⊤ r x + b r ) = √ 2ReLU( w⊤ r x ) and the NN function can be rewritten as N (x) = √ 2 √ m m r=1 a r ReLU( w⊤ r x). By replacing the expectation over random initialization of w by w(t), we define the instantiations of H ∞ at the kth iteration by H(k), where H ij (k) = 1 m x⊤ i xj m r=1 1 {x ⊤ i wr(k)≥0, x⊤ j wr(k)≥0} , where 1 is an indicator function. B THE MATRIX H ∞ IS SYMMETRIC AND POSITIVE DEFINITE Proposition 1 states that the matrix H ∞ defined by eq. ( 9) is symmetric and positive definite. While the symmetry of H ∞ is immediate from its closed-form expression, the fact that it is positive definite requires a more detailed analysis. The proof idea is similar to that of Theorem 3.1 of (Du et al., 2018) , in which the matrix H ∞ is associated with a 2-layer ReLU NN without biases. However, our H ∞ is associated with a 2-layer ReLU NN with biases. While (Du et al., 2018) requires that no two training data points are parallel, we allow the existence of x i1 = -x i2 for some i 1 and i 2 . To deal with this case, our proof employs a pair of nodes denoted by x i1 , x i2 . Proof of Proposition 1. For a measurable function f : R d → R d+1 , we define a norm of f as ∥f ∥ 2 H = E w∼N (0,κ 2 I) ∥f (w)∥ 2 2 , and let H be the space of measurable functions such that ∥f ∥ H < ∞. It can be shown that H is a Hilbert space with respect to the inner product ⟨f, g⟩ H = E w∼N (0,κfoot_1 I) f (w) ⊤ g(w) (Du et al., 2018) . For each x i , 1 ≤ i ≤ n, we define the function ϕ i by ϕ i (w) = xi 1 {w ⊤ xi≥0} , w ∈ R d . Then, ϕ i ∈ H for all i, and H ∞ ij = ⟨ϕ i , ϕ j ⟩ H . We prove that H ∞ is positive definite by showing that {ϕ i } n i=1 is a linearly independent set in H. To show {ϕ i } n i=1 is a linearly independent set, we show that α 1 ϕ 1 (w) + • • • + α n ϕ n (w) = 0 for almost every w ∈ R d (26) implies that α i = 0 for 1 ≤ i ≤ n. We fix some 1 ≤ i 1 ≤ n and, without loss of generality, assume that x i2 = -x i1 . 2 Define the set D j = {w ∈ R d | w ⊤ x j = 0} for 1 ≤ j ≤ n. As a result, D i1 = D i2 . Since each D j is a hyperplane passing through the origin and D i1 ̸ = D j for any j ̸ = i 1 , i 2 , ∃z ∈ D i1 such that z / ∈ D j for any j ̸ = i 1 , i 2 . For a positive radius R > 0, let B R = B(z, R) be the ball centered at z of radius R. Define a partition of B R into two sets denoted by B + R and B - R (possibly missing a subset of B R that has zero Lebesgue measure), where B + R = {w ∈ B R | w ⊤ x i1 > 0} = {w ∈ B R | w ⊤ x i2 < 0}, B - R = {w ∈ B R | w ⊤ x i1 < 0} = {w ∈ B R | w ⊤ x i2 > 0}. Since D j is closed for each 1 ≤ j ≤ n, B R is eventually disjoint from D j as R → 0. Hence, we have lim R→0 sup w∈B R |ϕ j (z) -ϕ j (w)| = 0, j ̸ = i 1 , i 2 , where |•| denotes the Euclidean distance. Then, for any j ̸ = i 1 , i 2 , we have lim R→0 1 |B + R | ˆB+ R ϕ j (w)dw = ϕ j (z), lim R→0 1 |B - R | ˆB- R ϕ j (w)dw = ϕ j (z), where |•| denotes the Lebesgue measure of a set. Consequently, we find that lim R→0 1 |B + R | ˆB+ R ϕ j (w)dw - 1 |B - R | ˆB- R ϕ j (w)dw = 0, j ̸ = i 1 , i 2 . Now, consider the integral of ϕ i1 and ϕ i2 . We have lim R→0 1 |B + R | ˆB+ R ϕ i1 (w)dw- 1 |B - R | ˆB- R ϕ i1 (w)dw = lim R→0 1 |B + R | ˆB+ R xi1 dw- 1 |B - R | ˆB- R 0dw = xi1 , lim R→0 1 |B + R | ˆB+ R ϕ i2 (w)dw- 1 |B - R | ˆB- R ϕ i2 (w)dw = lim R→0 1 |B + R | ˆB+ R 0dw- 1 |B - R | ˆB- R xi2 dw = -x i2 . By applying these limiting expressions to n j=1 α j ϕ j (w) = 0, we find that α i1 xi1 -α i2 xi2 = 0. Since the last entries of both xi1 and xi2 are 1/ √ 2, we have α i1 = α i2 . Thus, we have α i1 = 0 because x i1 ̸ = 0. Since i 1 is arbitrary, we showed that α j = 0 for every 1 ≤ j ≤ n, and the statement of the proposition follows. It is clear that the proof of Proposition 1 is also true if we assume that each entry of w r is initialized from an iid sub-Gaussian distribution with zero mean and whose support is the entire R, and we update the definition of H ∞ according to eq. ( 9).

C THE CONVERGENCE OF NEURAL NETWORK TRAINING

In this section, we develop the theory for learning a NN with a general loss function Φ P defined by a positive definite matrix P in eq. ( 6). In particular, we prove Theorem 1, which states that provided the learning rate is sufficiently small, the weights are initialized without too much variance, and the NN is sufficiently wide, then the residual in the first few epochs can be described with the matrix H ∞ P. While our proof is similar to that of (Su & Yang, 2019) , the argument is distinct in three essential ways: (1) Our proof applies to any loss function defined by a positive definite matrix P, which requires us to use a different Hilbert space (R n , ⟨•, •⟩ P ). (2) While the result in (Su & Yang, 2019) bounds the residual using the minimum eigenvalue of H ∞ , we estimate the residual as a matrixvector product of (I -2ηH ∞ P) k y, which allows us to analyze the training error using all eigenvalues of H ∞ P. (3) We use a different NN function that incorporates the bias terms and we do not assume that we initialize the weights in a way that makes N 0 = 0. Before we prove the theorem, we define some useful quantities. Let A be the set of indices such that the coefficients a r are initialized to 1 and let B be the set initialized to -1. We then decompose H(k) into two parts, where H(k) is defined in eq. ( 25), so that H(k) = H + (k) + H -(k) with H + ij (k) = 1 m x⊤ i xj r∈A 1 wr(k) ⊤ xi≥0 wr(k) ⊤ xj ≥0 , H - ij (k) = 1 m x⊤ i xj r∈B 1 wr(k) ⊤ xi≥0 wr(k) ⊤ xj ≥0 . Similarly, we define two other matrices H+ (k) and H-(k) as H+ ij (k) = 1 m x⊤ i xj r∈A 1 wr(k+1) ⊤ xi≥0 wr(k) ⊤ xj ≥0 , H- ij (k) = 1 m x⊤ i xj r∈B 1 wr(k+1) ⊤ xi≥0 wr(k) ⊤ xj ≥0 . Unfortunately, H+ (k) and H-(k) are not necessarily symmetric and they differ from H + (k) and H -(k) up to sign flips. To simplify the notation later, we also define two auxiliary matrices L(k) and M(k) as L(k) = H+ (k) -H + (k), M(k) = H-(k) -H -(k). We now prove that I -2ηH(k)P is close to the transition matrix for the residual, up to sign flips, i.e., yu(k + 1) ≈ (I -2ηH(k)P)(y -u(k)). Lemma 1. Let z(k) = yu(k) be the residual after the kth iteration. For any k ≥ 0 and η > 0, we have I -2η H+ (k) + H -(k) P z(k) ≤ z(k + 1) ≤ I -2η H + (k) + H-(k) P z(k), where the inequalities are entry-wise. Proof. First, by the gradient descent update rule, we have wr (k + 1) -wr (k) = -η ∂Φ P ( w1 (k), . . . , wm (k)) ∂ wr = -η ∂u(k) ∂ wr ∂Φ P (u) ∂u , where the (d + 1) × n Jacobian matrix is given by ∂u(k) ∂ wr = √ 2a r √ m x1 1 {x ⊤ 1 wr(k)≥0} . . . xn 1 {x ⊤ n wr(k)≥0} , and the gradient of the loss function Φ P defined in eq. ( 6) with respect to u is given by ∂Φ P (u) ∂u = -P(y -u(k)) = -Pz(k), which is a vector of length n. Hence, it follows that wr (k + 1) ⊤ xi -wr (k) ⊤ xi = √ 2ηa r √ m n p=1 (Pz(k)) p x⊤ i xp 1 {x ⊤ p wr(k)≥0} , where (Pz(k)) p denotes the pth element of Pz(k). Using the property of ReLU that (b -a)1 {a>0} ≤ ReLU(b) -ReLU(a) ≤ (b -a)1 {b>0} , a, b ∈ R, we have ReLU wr (k + 1) ⊤ xi -ReLU wr (k) ⊤ xi ≤ √ 2ηa r √ m n p=1 (Pz(k)) p x⊤ i xp 1 {x ⊤ p wr(k)≥0} 1 {x ⊤ i wr(k+1)≥0} , ReLU wr (k + 1) ⊤ xi -ReLU wr (k) ⊤ xi ≥ √ 2ηa r √ m n p=1 (Pz(k)) p x⊤ i xp 1 {x ⊤ p wr(k)≥0} 1 {x ⊤ i wr(k)≥0} . Hence, we have (u(k + 1)) i -(u(k)) i = √ 2 √ m r∈A ReLU( wr (k + 1) ⊤ xi ) -ReLU( wr (k) ⊤ xi ) - √ 2 √ m r∈B ReLU( wr (k + 1) ⊤ xi ) -ReLU( wr (k) ⊤ xi ) ≤ 2η m r∈A n p=1 (Pz(k)) p x⊤ i xp 1 {x ⊤ p wr(k)≥0} 1 {x ⊤ i wr(k+1)≥0} + 2η m r∈B n p=1 (Pz(k)) p x⊤ i xp 1 {x ⊤ p wr(k)≥0} 1 {x ⊤ i wr(k)≥0} = 2η n p=1 H+ ip (k) + H - ip (k) (Pz(k)) p . This proves the first inequality. The second inequality can be shown with a similar argument. In particular, if there is no sign flip of the weights, then H(k) = H+ (k) + H-(k) = H(k) and the inequalities in Lemma 1 are equalities. Next, using Lemma 1, we can derive an expression for yu(k) using H ∞ , up to an error term. Lemma 2. For any 0 < η < 1/(2M 2 P n) and any k ≥ 0, we have that y -u(k) = (I -2ηH ∞ P) k (y -u(0)) + ϵ(k), where ∥ϵ(k)∥ P ≤ 2η k-1 t=0 ∥(H ∞ -H(t))P∥ P (I -2ηH ∞ P) t (y -u(0)) P + 2η k-1 t=0 (∥M(t)P∥ P + ∥L(t)P∥ P ) ∥y -u(t)∥ P . (30) Proof. For k ≥ 1, we define r(k) by r(k) = y -u(k) -(I -2ηH(k -1)P) (y -u(k -1)). Then, by Lemma 1 we have ∥r(k)∥ P ≤ 2η (∥M(k -1)P∥ P + ∥L(k -1)P∥ P ) ∥y -u(k -1)∥ P . Note that eq. ( 31) is a first-order non-homogeneous recurrence relation for yu(k), which has an analytic solution. Thus, we can expand yu(k) for k ≥ 1 as y -u(k) = ((I -2ηH(k -1)P) • • • (I -2ηH(0)P)) (y -u(0)) + r(k) + k-1 t=1 ((I -2ηH(k -1)P) • • • (I -2ηH(t)P)) r(t). Moreover, we can write the product of the matrices as (Su & Yang, 2019 ) (I -2ηH(k -1)P) • • • (I -2ηH(0)P) =(I -2ηH ∞ P) k + 2η(H ∞ P-H(k-1)P)(I-2ηH ∞ P) k-1 +2η k-1 t=1 ((I-2ηH(k-1)P)• • •(I-2ηH(t)P)) (H ∞ P-H(t-1)P)(I-2ηH ∞ P) t-1 . Combining eq. ( 33) and eq. ( 34), we obtain eq. ( 29) where ϵ(k) = 2η(H ∞ P-H(k-1)P)(I-2ηH ∞ P) k-1 (y-u(0)) +2η k-1 t=1 (I-2ηH(k-1)P)• • •(I-2ηH(t)P)(H ∞ -H(t-1))P(I-2ηH ∞ P) t-1 (y-u(0)) + r(k) + k-1 t=1 (I-2ηH(k-1)P) • • • (I-2ηH(t)P)r(t). Finally, we note that  λ max (H(t)P) = λ max (P 1/2 H(t)P 1/2 ) = P 1/2 H(t)P 1/2 2 = sup ξ∈R n \{0} P 1/2 H(t)P 1/2 ξ 2 ∥ξ∥ 2 = sup ζ∈R n \{0} P 1/2 H(t)Pζ 2 P 1/2 ζ 2 = sup ζ∈R n \{0} ∥H(t We can then bound ∥H(t)P∥ P using M P defined in eq. ( 8) and ∥H(t)∥ 2 as ∥H(t)P∥ P ≤ M P ∥H(t)∥ 2 M P ≤ M 2 P ∥H(t)∥ 1 ∥H(t)∥ ∞ ≤ M 2 P n. By requiring that η < 1/(2M 2 P n), we have λ min (I -2ηH(t)P) > 0 for all t. Hence, I -2ηH(t)P is positive definite in (R n , ⟨•, •⟩ P ), and according to eq. ( 35), we have ∥I -2ηH(t)P∥ P = λ max (I-2ηH(t)P) < 1. The upper bound in eq. ( 30) follows from the triangle inequality and our estimate on r k in eq. ( 32). The residual terms in Lemma 2 can be made small by controlling ∥M(t)P∥ P + ∥L(t)P∥ P and ∥(H(0) -H(t))P∥ P . Their upper bounds are given in Lemma 3. First, we define S i (t) = 1 ≤ r ≤ m | 1 { wr(t ′ ) ⊤ xi≥0} ̸ = 1 { wr(0) ⊤ xi≥0} for some 0 ≤ t ′ ≤ t to be the set of indices of the weights that have changed the sign at least once by the kth iteration. Lemma 3. For all t ≥ 0, we have max (∥M(t)P∥ P + ∥L(t)P∥ P , ∥(H(0) -H(t))P∥ P ) ≤ 4M 4 P n m 2 n i=1 |S i (t)| 2 . Moreover, for any 0 < δ < 1, with probability at least 1 -δ, we have ∥(H ∞ -H(0))P∥ P ≤ 2M 2 P n log(2n/δ) m . Proof. First, we have ∥M(t)P∥ 2 P ≤ M 4 P ∥M(t)∥ 2 2 ≤ M 4 P ∥M(t)∥ 2 F ≤ M 4 P m 2 n i=1 n p=1 r∈A 1 { wr(t) ⊤ xi≥0, wr(t) ⊤ xp≥0} -1 { wr(t+1) ⊤ xi≥0, wr(t) ⊤ xp≥0} 2 ≤ M 4 P n m 2 n i=1 |S i (t)| 2 . The estimate for ∥L(t)P∥ 2 P is exactly the same and obtained by replacing A with B. We also have ∥(H(0) -H(t))P∥ 2 P ≤ M 4 P ∥H(0) -H(t)∥ 2 2 ≤ M 4 P ∥H(0) -H(t)∥ 2 F ≤ M 4 P m 2 n i=1 n p=1 m r=1 |1 { wr(0) ⊤ xi≥0, wr(0) ⊤ xp≥0} -1 { wr(t) ⊤ xi≥0, wr(t) ⊤ xp≥0} | 2 ≤ M 4 P m 2 n i=1 n p=1 (|S i (t)| + |S p (t)|) 2 ≤ M 4 P m 2 n i=1 n p=1 2 |S i (t)| 2 + 2 |S p (t)| 2 = 4M 4 P n m 2 n i=1 |S i (t)| 2 . This proves the first inequality. Since H ij (0) is the average of m iid random variables bounded in [0, 1], by Hoeffding's inequality (Hoeffding, 1963) , for any t > 0 and any 1 ≤ i, j ≤ n, we have P m H ij (0) -H ∞ ij ≥ t ≤ 2exp - 2t 2 m ≤ 2exp - t 2 m . ( ) Set t = m log(2n 2 /δ). With probability at least 1 -δ/n 2 , we have H ij (0) -H ∞ ij ≤ log (2n 2 /δ) m ≤ 2 log (2n/δ) m . Hence, by a union bound, we know that with probability at least 1 -δ, we have ∥H ∞ -H(0)∥ 2 ≤ ∥H ∞ -H(0)∥ F ≤ 2n log(2n/δ) m . The last estimate follows from the definitions of M P (see eq. ( 8) and eq. ( 36)). Now, we state and prove our initial control of the decay of the residual. Lemma 4. Let ϵ > 0, κ > 0, 0 < δ < 1 and T > 0 be given. There exist constants C m , C ′ m > 0 such that if 0 ≤ η ≤ 1/(2M 2 P n) and m satisfies m ≥ C m M 6 P n 3 κ 2 ϵ 2 λ -4 0 1 + κ 2 M 2 P n δ 2 + η 4 T 4 ϵ 4 and m ≥ C ′ m M 4 P n 2 log(n/δ) ϵ 2 λ -2 0 1 + κ 2 M 2 P n δ + η 2 T 2 ϵ 2 , then with probability at least 1 -δ, we have the following for all 0 ≤ k ≤ T : y -u(k) = (I -2ηH ∞ P) k (y -u(0)) + ϵ(k), ∥ϵ(k)∥ P ≤ ϵ. ( ) Proof. Set δ ′ = δ/3. For any R > 0 and r = 1, . . . , m, since w r (0) ⊤ x i ∼ N (0, κ 2 ), we have P wr (0) ⊤ xi ≤ R = E 1 {| wr(0) ⊤ xi|≤R} < 2R √ πκ . By Hoeffding's inequality (Hoeffding, 1963) , for any t > 0 we have P m r=1 1 {| wr(0) ⊤ xi|≤R} ≥ 2mR √ πκ + t ≤ exp - 2t 2 m ≤ exp - t 2 m , 1 ≤ i ≤ n. (40) Thus, if we set t = m log(n/δ ′ ) then we find that with probability at least 1 -δ ′ /n we have m r=1 1 {| wr(0) ⊤ xi|≤R} ≤ 2mR √ πκ + m log(n/δ ′ ) ≤ 2m R √ πκ + log (n/δ ′ ) m . Published as a conference paper at ICLR 2023 By a union bound, we have with probability at least 1 -δ ′ , n i=1 m r=1 1 {| wr(0) ⊤ xi|≤R} 2 ≤ 4m 2 n R √ πκ + log(n/δ ′ ) m 2 . By combining this with Lemma 3, we have that with probability at least 1 -2δ ′ , 4M 4 P n m 2 n i=1 m r=1 1 {| wr(0) ⊤ xi|≤R} 2 ≤ 4M 2 P n R √ πκ + log(n/δ ′ ) m , and ∥(H ∞ -H(0))P∥ P ≤ 2M 2 P n log(2n/δ ′ ) m . ( ) Since the ith entry of u(0) has mean 0 and variance ≤ κ 2 , we have E[(u(0)) 2 i ] ≤ κ 2 , where (u(0)) i is the ith entry of u(0). Hence, we have E ∥u(0)∥ 2 P ≤ M 2 P nκ 2 . By Markov's inequality, with probability at least 1 -δ ′ , we have ∥u(0)∥ P ≤ κM P n/δ ′ , ∥y -u(0)∥ P ≤ ∥y∥ P + κM P n/δ ′ . ( ) By a union bound, we know that eqs. ( 41) to ( 43) hold with probability of at least 1 -3δ ′ . The theorem now follows using induction, where the base case when k = 0 is obvious. Assume eq. ( 39) holds for t = 0, . . . , k -1, where 1 ≤ k ≤ T . Then, we have 2η k-1 t=0 ∥y -u(t)∥ P ≤ 2η k-1 t=0 (1 -2ηλ 0 ) t ∥y -u(0)∥ P + ϵ ≤ λ -1 0 ∥y -u(0)∥ P +2ηT ϵ, where the first inequality follows from the fact that I -2ηH ∞ P is positive semidefinite in (R n , ⟨•, •⟩ P ) with the maximum eigenvalue being 1 -2ηλ 0 , and the second inequality follows by bounding the power series. By the definition of λ 0 , we have 2η k-1 t=0 (I -2ηH ∞ P) t (y -u(0)) P ≤ 2η k-1 t=0 (1 -2ηλ 0 ) t ∥y -u(0)∥ P ≤ λ -1 0 ∥y -u(0)∥ P . (45) By Lemma 2 and 3, we have y -u(k) = (I -2ηH ∞ P) k (y -u(0)) + ϵ(k) with ∥ϵ(k)∥ P ≤ λ -1 0 ∥(H(0) -H ∞ )P∥ P ∥y -u(0)∥ P + 2 4M 4 P n m 2 n i=1 |S i (k)| 2 λ -1 0 ∥y -u(0)∥ P + ηT ϵ , where we used the fact that |S i (t)| is a nondecreasing function of t and the triangle inequality ∥(H ∞ -H(t))P∥ P ≤ ∥(H(0) -H ∞ )P∥ P + ∥(H(t) -H(0))P∥ P . Here, we also combined λ -1 0 ∥(H(t) -H(0))P∥ P ∥y -u(0)∥ P with the last term on the right-hand side of eq. ( 30) and applied Lemma 3, eq. ( 44), and eq. ( 45) to obtain the last term on the right-hand side of eq. (46). To control |S i (k)|, we first bound the change of the weights. For any 0 ≤ t ≤ k -1 and 1 ≤ r ≤ m, the change of weights in one iteration can be bounded by ∥ wr (t + 1) -wr (t)∥ 2 = η ∂u(t) ∂ wr ∂Φ P (u) ∂u 2 ≤ η ∂u(t) ∂ wr F ∥P(y -u(t))∥ 2 ≤ η 2n m M P ∥y -u(t)∥ P , where the inequalities follow from eq. ( 27) and eq. ( 28). Hence, the total change of the weights can be bounded by ∥ wr (t) -wr (0)∥ 2 ≤ ηM P 2n m t-1 t ′ =0 ∥y -u(t ′ )∥ P ≤ R T , where R T = M P n 2m λ -1 0 ∥y -u(0)∥ P + 2ηT ϵ . Recall that S i (k) is the set of indices of weights that have gone through at least one sign flip by iteration k. Thus, if r ∈ S i (k), we have ∥ wr (t) -wr (0)∥ 2 ≥ wr (0) ⊤ xi for some 0 ≤ t ≤ k as the sign flip leads to wr (0) ⊤ xi ≤ wr (0) ⊤ xiwr (t) ⊤ xi . This gives us |S i (k)| ≤ r ∈ [m] : wr (0) ⊤ xi ≤ ∥ wr (t) -wr (0)∥ 2 for some 0 ≤ t ≤ k ≤ r ∈ [m] : wr (0) ⊤ xi ≤ R T , where [m] = {1, . . . , m}. Hence, there exists a constant C > 0 such that ∥ϵ(k)∥ P ≤ 2M 2 P n log(2n/δ ′ ) m λ -1 0 ∥y -u(0)∥ P + 8M 2 P n R T √ πκ + log(n/δ ′ ) m λ -1 0 ∥y -u(0)∥ P + ηT ϵ ≤ 2M 2 P n log(6n/δ) m λ -1 0 C 1 + κM P 3n δ A1 + 8M 2 P n       1 √ πκ M P n 2m λ -1 0 C 1 + κM P 3n δ + 2ηT ϵ A2 + log(3n/δ) m A3       × λ -1 0 C 1 + κM P 3n δ + ηT ϵ A4 , where the first inequality follows from eq. ( 41), eq. ( 46), and eq. ( 48), and the second inequality follows from eq. ( 43) and eq. ( 47). Finally, eq. ( 39) follows from the way we define m. By taking C m large enough, we guarantee that 2M 2 P nA 1 , 8M 2 P nA 2 A 4 < ϵ/3. By taking C ′ m large enough, we guarantee that 8M 2 P nA 3 A 4 < ϵ/3. Hence, eq. ( 39) follows. Lemma 4 gives us an estimate of the residual yu(k) in terms of the initial residual yu(0). However, in analyzing the frequency bias, we hope to express the residual in terms of y only. This can be done by controlling the size of u(0). First, we note that the proof of eq. ( 43) does not rely on the assumptions on m in Lemma 4. Hence, it holds for any n, m ≥ 1, κ > 0, 0 < δ < 1, and positive definite matrix P. Now we are ready to prove our first main result Theorem 1. Proof of Theorem 1. By eq. ( 43), with probability at least 1 -δ/2, we have (I -ηH ∞ P) k u(0) P ≤ ∥u(0)∥ P ≤ κM P 2n/δ. ( ) By taking κ ≤ ϵ δ/2n/(2M P ), we guarantee that (I -ηH ∞ P) k u(0) P ≤ ϵ/2. By the way we pick κ and m, for some constant C ′ > 0 that only depends on d, we have 1 + κ 2 M 2 P n δ ≤ C ′ . Hence, by taking C 2 in eq. ( 13) large enough, we guarantee that m satisfies the assumptions in Lemma 4 with ϵ, κ, T, and δ to be ϵ/2, κ, T , and δ/2, respectively. Then, we have eq. ( 39) is true with probability of at least 1 -δ/2, for which ∥ϵ(k)∥ P ≤ ϵ/2. The result follows from the triangle inequality and union bound. Notably, there are other initialization schemes that allow us to avoid using κ. One of the examples is to initialize the weights at odd indices w 2p+1 and a 2p+1 randomly and set w 2p+2 = w 2p+1 , a 2p+2 = -a 2r+1 , assuming m is even (Su & Yang, 2019) . This initialization scheme guarantees that u(0) = 0 and hence we do not need to introduce κ to control the initialization size ∥u(0)∥ P . In addition, if we assume that each entry of w r is initialized from an iid sub-Gaussian distribution with zero mean and whose support is the entire R, then since H ∞ is still SPD (see the remark at the end of Appendix B) and the Hoeffding's inequality holds, the proof does not break down and a result similar to Theorem 1 can be shown. Following the same steps of proof, we can study the case when the gradient descent steps are slightly perturbed. That is, Suppose we perturb the output of the NN by δu j at the jth iteration. Then, we expect that the residual of the NN at the kth iteration is approximately y -u(k) ≈ (I -2ηH ∞ P) • • • ((I -2ηH ∞ P) ((I -2ηH ∞ P) y + δu 1 ) + δu 2 ) + • • • + δu k = (I -2ηH ∞ P) k y + k j=1 (I -2ηH ∞ P) k-j δu j . Since the maximum eigenvalue of I -2ηH ∞ P is less than one, we can then control the errors in this approximation using arguments similar to previous lemmas. In Theorem 1, we showed how the parameters of the NN should depend on the desired maximum error ϵ. Sometimes, it is also very useful to understand the dependence of ϵ on the parameters η, T, n, m, etc. Therefore, we present the following result to show this dependency. Theorem 1 ′ . In eq. ( 5), suppose that w 1 , . . . , w m are initialized iid from Gaussian random variables with covariance matrix κ 2 I, b 1 , . . . , b m are initialized to zero, and a 1 , . . . , a m are initialized iid as +1 with probability 1/2 and -1 otherwise. Suppose the NN is trained with training data (x i , y i ) for 1 ≤ i ≤ n, loss function Φ P in eq. ( 6) for a symmetric positive definite matrix P, and the training procedure is the gradient descent update rule eq. ( 7) with step size η ≤ 1/(2M 2 P n). Let N k be the NN function after the kth iteration and u(k) = (N k (x 1 ), . . . , N k (x n )), where N 0 is the initial NN function. Let a probability of failure 0 < δ < 1 be given. Then, there exists a constant C > 0 that depends only on the dimension d such that with probability ≥ 1 -δ, the following statement holds: for any k ≥ 1, if we define ϵ k by ϵ k = max 1≤t≤k-1 y -u(t) -(I -2ηH ∞ P) t y P , then we have y -u(k) -(I -2ηH ∞ P) k y P ≤ 2M 2 P n log(6n/δ) m λ -1 0 C 1 + κM P 3n δ + 8M 2 P n 1 √ πκ M P n 2m λ -1 0 C 1 + κM P 3n δ + 2ηkϵ k + log(3n/δ) m × λ -1 0 C 1 + κM P 3n δ + ηkϵ k + κM P 2n δ . Proof. The result follows immediately from eq. ( 49) and eq. ( 50). While this paper is primarily concerned with learning a continuous function using loss functions that are adapted from MSE, we briefly discuss the changes that are needed to study classifiers trained by cross-entropy. If the classification task has p classes, then our NN has p outputs, each of which is then passed through a softmax layer and represents an estimate of the likelihood that the input belongs to the corresponding class. More information on the NN architecture and the cross-entropy loss function can be found in (Kurbiel, 2021) . Let N j (x; W) be the jth entry of the outputs of the NN and let u j (x; W) = softmax(N j (x; W)), 1 ≤ j ≤ p. Let g j (x) be the ground-truth of the jth entry of the outputs and let z j (x; W) = g j (x) -u j (x; W) be the residual. Assume we use the gradient flow and have access to data on the entire domain. Then, to derive a formula analogous to eq. ( 2), for each fixed x, we have dz j (x; W) dt = - du j (x; W) dt = - p i=1 du j (x; W) dN i dN i (x; W) dt = - p i=1 du j (x; W) dN i ∂N i (x; W) ∂W dW dt = p i=1 du j (x; W) dN i ∂N i (x; W) ∂W ∂L(W) ∂W ⊤ = p i=1 du j (x; W) dN i ∂N i (x; W) ∂W p i ′ =1 ˆSd-1 -z i ′ (x ′ ; W) ∂N i ′ (x ′ ; W) ∂W ⊤ dµ(x ′ ) , where L is the cross-entropy loss function computed as an integral over S d-1 and in the last step we used the fact that (Kurbiel, 2021) . Hence, eq. ( 2) now becomes (d/dN i ′ )L(x ′ ; W) = u i ′ (x ′ ; W) -g i ′ (x ′ ) = -z i ′ (x ′ ; W) dz j (x; W) dt = - p i=1      du j (x; W) dN i p i ′ =1 ˆSd-1 ∂N i ∂W (x; W), ∂N i ′ ∂W (x ′ ; W) =K i,i ′ (x,x ′ ;W) z i ′ (x ′ ; W)dµ(x ′ )      , where we also know that (d/dN i )u j (x; W) = u i (x; W)(1 {i=j} -u j (x; W)). Again, we define H ∞ i,i ′ as the discretization of K i,i ′ in expectation over random initialization of W. that H ∞ i,i coincides with H ∞ that we used extensively in this paper. However, the entries of H ∞ i,i ′ might differ with H ∞ in signs. This potentially causes difficulties in analyzing the spectrum of H ∞ i,i ′ . Now, by the formula above, we expect that the residual can approximately be written as  z j (k + 1) -z j (k) = [g j (x) -u j (k + 1)] -[g j (x) -u j (k)] ≈ -η p i=1 u i (k) • ([1 {i=j} , • • • , 1 {i=j} ] ⊤ -u j (k)) • p i ′ =1 H ∞ i,i ′ Pz i ′ (k) , diag u i (k) • ([1 {i=j} , • • • , 1 {i=j} ] ⊤ -u j (k)) for i, j = 0, . . . , p -1. Let H ∞ be the np × np block matrix such that H ∞ (in+1):(i+1)n,(i ′ n+1):(i ′ +1)n = H ∞ i,i ′ for i, i ′ = 0, . . . , p -1. Then, eq. ( 52) can be written compactly as z(k + 1) -z(k) ≈ -ηJ(k)H ∞ (I p ⊗ P)z(k), where '⊗' is the Kronecker product. Frequency bias can be analyzed by studying the dynamics of z(k) based on eq. ( 53). However, the fact that J depends on k is expected to add complication to the analysis.

D THE THEORY OF FREQUENCY BIAS WITH AN L 2 -BASED LOSS FUNCTION

In this section, we prove the results stated in Section 4, where we are concerned with the frequency bias behavior of NN training when using the squared L 2 norm as the loss function. We theoretically show the frequency bias phenomena in this setting, up to a quadrature error.

D.1 A CONSEQUENCE OF THEOREM 1

Given a bandlimited function g : S d-1 → R with bandlimit L, we can uniquely decompose g into a spherical harmonic expansion as g(x) = L ℓ=0 g ℓ (x), where g ℓ (x) ∈ H d ℓ . Here, H d ℓ is the space of the restriction of (real) homogeneous harmonic polynomials of degree ℓ on S d-1 . That is, g ℓ (x) = N (d,ℓ) p=1 ĝℓ,p Y ℓ,p , ĝℓ,p = ˆSd-1 g(x)Y ℓ,p (x)dx. where N (d, ℓ) is given in section 2 and Y ℓ,p are the spherical harmonic function of degree ℓ and order p. As a consequence of Theorem 1, we can consider the NN training error with the squared L 2 -based loss function. Proof of Theorem 2. By Theorem 1, for every k = 0, . . . , T , we can write ∥u(k) -y∥ c = (I-2ηH ∞ D c ) k y c + ε 3 (k), |ε 3 (k)| ≤ ∥ϵ(k)∥ c ≤ ϵ. ( ) To estimate the first term, we first note that the matrix I -2ηH ∞ D c is positive semidefinite in (R n , ⟨•, •⟩ c ) and ∥I -2ηH ∞ D c ∥ c ≤ 1 -2ηλ 0 (see Theorem 1). Since g(x) = L ℓ=0 g ℓ (x), we have (I -2ηH ∞ D c ) k y = L ℓ=0 (I -2ηH ∞ D c ) k y ℓ , y = L ℓ=0 y ℓ , where y ℓ = [g ℓ (x 1 ), . . . , g ℓ (x n )] ⊤ ∈ R n . By the Funk-Hecke formula and the quadrature rule, we have H ∞ D c y ℓ p = n i=1 c i K ∞ (x p , x i )g ℓ (x i ) = ˆSd-1 K ∞ (x p , ξ)g ℓ (ξ)dξ + e d p,ℓ = µ ℓ g ℓ (x p ) + e d p,ℓ , where e d p,ℓ is a quadrature error (see eq. ( 16)). Therefore, in vectorized form, we have H ∞ D c y ℓ = µ ℓ y ℓ + e d ℓ or, equivalently, (I -2ηH ∞ D c )y ℓ = (1 -2ηµ ℓ ) y ℓ -2ηe d ℓ , where e d ℓ = (e d 1,ℓ , . . . , e d n,ℓ ) ⊤ . By applying I -2ηH ∞ D c to y ℓ for k times, we find that (I-2ηH ∞ D c ) k y ℓ = (1 -2ηµ ℓ ) k y ℓ -2η k-1 t=0 (1 -2ηµ ℓ ) t (I-2ηH ∞ D c ) k-t-1 e d ℓ ε ℓ 2 . The second term, ε ℓ 2 , can be easily bounded from above to obtain ε ℓ 2 c ≤ 2η e d ℓ c ∞ t=0 (1-2ηµ ℓ ) t = 1 µ ℓ e d ℓ c . The inequality above shows that (I-2ηH ∞ D c ) k y ℓ is close to (1 -2ηµ ℓ ) k y ℓ . We define ε 2 = L ℓ=0 (I-2ηH ∞ D c ) k y ℓ c - L ℓ=0 (1 -2ηµ ℓ ) k y ℓ c to be the quantity in the statement of Theorem 2, which measures the accuracy of approximating the eigenvalues and eigenvectors of H ∞ D c using the eigenvalues and eigenfunctions of the continuous kernel K ∞ . Hence, we have (I-2ηH ∞ D c ) k y c = L ℓ=0 (1-2ηµ ℓ ) k y ℓ c +ε 2 . ( ) Using the triangle inequality, we have |ε 2 | ≤ L ℓ=0 ε ℓ 2 c ≤ L ℓ=0 1 µ ℓ e d ℓ c = L ℓ=0 1 µ ℓ n i=1 c i (e d i,ℓ ) 2 ≤ L ℓ=0 √ A d µ ℓ max 1≤i≤n e d i,ℓ . Recall that n i=1 c i = A d , the surface area of S d-1 . Next, we can write L ℓ=0 (1-2ηµ ℓ ) k y ℓ 2 c = L ℓ=0 (1-2ηµ ℓ ) k y ℓ ⊤ D c L ℓ=0 (1-2ηµ ℓ ) k y ℓ = L j=0 L ℓ=0 (1-2ηµ j ) k (1-2ηµ ℓ ) k y j ⊤ D c y ℓ = L j=0 L ℓ=0 (1-2ηµ j ) k (1-2ηµ ℓ ) k ˆSd-1 g j (ξ)g ℓ (ξ)dξ + e c j,ℓ = L ℓ=0 (1-2ηµ ℓ ) 2k ∥g ℓ ∥ 2 L 2 + L j=0 L ℓ=0 (1-2ηµ j ) k (1-2ηµ ℓ ) k e c j,ℓ ε1(k) , where we used the fact that g j and g ℓ are orthogonal in L 2 (S d-1 ) for j ̸ = ℓ. Combining eq. ( 55) and eq. ( 56), we can write (I-2ηH ∞ D c ) k y c = L j=0 (1-2ηµ j ) 2k ∥g j ∥ 2 L 2 + ε 1 (k) + ε 2 , ( ) where |ε 1 (k)| ≤ L j=0 L ℓ=0 (1-2ηµ j ) k (1-2ηµ ℓ ) k e c j,ℓ . The result follows from eq. ( 54) and eq. ( 57). Similarly, suppose we use Theorem 1 ′ , we can write down Theorem 2 in a form such that ϵ 3 (k) depends on the parameters of the NN. Theorem 2 ′ . Under the same setup and assumptions of Theorem 1 ′ , let P = D c and M P = √ c max . If g : S d-1 → R is a bandlimited function with bandlimit L and 1 -2ηµ ℓ > 0 for all 0 ≤ ℓ ≤ L (see eq. ( 12)), then with probability ≥ 1 -δ we have ∥y-u(k)∥ c = L ℓ=0 (1-2ηµ ℓ ) 2k ∥g ℓ ∥ 2 L 2 +ε 1 (k)+ε 2 +ε 3 (k), 0 ≤ k ≤ T, ( ) where |ε 3 (k)| is bounded by eq. ( 51), and ε 1 (k) and ε 2 satisfy |ε 1 (k)| ≤ L j=0 L ℓ=0 (1-2ηµ j ) k (1-2ηµ ℓ ) k e c j,ℓ , |ε 2 | ≤ L ℓ=0 √ A d µ ℓ max 1≤i≤n e d i,ℓ . We remark that the left-hand side of eq. ( 17) can be rewritten as ∥y-u(k)∥ c = n i=1 c i (y i -u i (k)) 2 = n i=1 c i (g(x i ) -N k (x i )) 2 = ∥g -N k ∥ 2 L 2 -E c ((g -N k ) 2 ) , where N k is the neural network at the kth iteration. Hence, eq. ( 17) can also be written as ∥g -N k ∥ L 2 = L ℓ=0 (1-2ηµ ℓ ) 2k ∥g ℓ ∥ 2 L 2 +ε 1 (k)+ε 2 +ε 3 (k)+ε 4 (k), where ε 1 , ε 2 , ε 3 are as in Theorem 2, and |ε 4 (k)| ≤ |E c ((g -N k ) 2 )|. Moreover, we assumed g is bandlimited, and as we see in Proposition 2, the spherical harmonic coefficients of N k decay fast as the frequency ℓ → ∞. This shows that the size of |ε 4 (k)| can also be controlled by γ n,ℓ in eq. ( 18). Therefore, we derived a frequency bias statement of the generalization error with respect to the uniform distribution on S d-1 . Published as a conference paper at ICLR 2023

D.2 FREQUENCY BIAS UP TO AN APPROXIMATION ERROR

Theorem 2 shows that we theoretically have frequency bias up to the level of quadrature errors ε 1 (k) and ε 2 . If the quadrature errors are large, then we may not observe frequency bias in practice. Here, we show that the quadrature errors can be made arbitrarily small by taking enough samples in the training data. Recall that our quadrature rule satisfies eq. ( 18). Next, we pass the quadrature error to the approximation error, which may not necessarily be tight. However, this allows us to use the existing theory of spherical harmonics approximation to show the decay of quadrature errors. Lemma 5. Let f : S d-1 → R be a function and dist(f, Π d ℓ ) = min h∈Π d ℓ ∥f -h∥ L ∞ . We have ˆSd-1 f (ξ)dξ - n i=1 c i f (x i ) ≤ 2γ n,ℓ ∥f ∥ L ∞ + 2A d dist(f, Π d ℓ ) for any integer ℓ ≥ 0. Proof. Let h ℓ = arg min h∈Π d ℓ ∥f -h∥ L ∞ . By the triangle inequality, we have ˆSd-1 f (ξ)dξ - n i=1 c i f (x i ) ≤ ˆSd-1 f (ξ)dξ - ˆSd-1 h ℓ (ξ)dξ + ˆSd-1 h ℓ (ξ)dξ - n i=1 c i h ℓ (x i ) + n i=1 c i (h ℓ (x i ) -f (x i )) ≤A d dist(f, Π d ℓ ) + γ n,ℓ ∥h ℓ ∥ L ∞ + A d dist(f, Π d ℓ ) ≤ 2γ n,ℓ ∥f ∥ L ∞ + 2A d dist(f, Π d ℓ ) , where we used the fact that c i > 0 for i = 1, . . . , n and the fact that ∥h ℓ ∥ L ∞ ≤ 2 ∥f ∥ L ∞ . Next, we focus on controlling the minimum approximation error dist(f, Π d ℓ ) on the right-hand side of eq. ( 60). To do so, we prove the following lemma. Lemma 6. Let f ij (ξ) = K ∞ (x i , ξ)g j (ξ) and g jp (ξ) = g j (ξ)g p (ξ). Then, for a constant C > 0 that only depends on d, we have dist(f ij , Π d ℓ ) ≤ C j + 1 ℓ ∥g j ∥ L ∞ , 1 ≤ i ≤ n, 0 ≤ j ≤ L, dist(g jp , Π d ℓ ) ≤ C j + p ℓ ∥g j ∥ L ∞ ∥g p ∥ L ∞ , 0 ≤ j, p ≤ L, for all ℓ ≥ 1. Proof. For 1 ≤ a < b ≤ d where a, b ∈ N, and t ∈ [-π, π), let Q a,b,t denote the action on S d-1 of rotation by the angle t in the (x a , x b )-plane. For an integer α ≥ 1, we define the operator on functions on S d-1 by ∆ α a,b,t = (I -T (Q a,b,t )) α , where T (Q)f (x) = f (Q -1 x). If f ∈ C(S d-1 ), then we define for t > 0 that ω α (f ; t) = max 1≤a<b≤d sup |θ|≤t ∆ α a,b,θ f L ∞ . By (Dai & Xu, 2013, Thm. 4.4. 2), we have dist(f, Π d ℓ ) ≤ c 1 ω α (f ; ℓ -1 ), ℓ ≥ 1, α ≥ 1, where c 1 > 0 is some constant that only depends on α. Then it is sufficient to bound ω α (f ij ; ℓ -1 ) and ω α (g jp ; ℓ -1 ) to finish the proof. First, we aim to bound the term dist( f ij , Π d ℓ ) where f ij (ξ) = K ∞ (x i , ξ)g j (ξ). We fix 1 ≤ i ≤ n, 1 ≤ a < b ≤ d where i, a, b ∈ N, and choose θ such that |θ| ≤ ℓ -1 . We have ∆ 1 a,b,θ f ij L ∞ = f ij (ξ) -f ij (Q -1 a,b,θ ξ) L ∞ . We then define δK ∞ i (ξ) := K ∞ (x i , Q -1 a,b,θ ξ) -K ∞ (x i , ξ), δg j (ξ) := g j (Q -1 a,b,θ ξ) -g j (ξ). We then have f ij (ξ)-f ij (Q -1 a,b,θ ξ) L ∞ = ∥K ∞ (x i , ξ)g j (ξ)-(K ∞ (x i , ξ)+δK ∞ i (ξ))(g j (ξ)+δg j (ξ))∥ L ∞ ≤ ∥δK ∞ i (ξ)(g j (ξ) + δg j (ξ))∥ L ∞ + ∥K ∞ (x i , ξ)δg j (ξ)∥ L ∞ . (61) We control the two terms separately. First, to control the second term, we write ∥K ∞ (x i , ξ)δg j (ξ)∥ L ∞ ≤ ∥K ∞ (x i , ξ)∥ L ∞ ∥δg j (ξ)∥ L ∞ ≤ 1 2 ∥δg j (ξ)∥ L ∞ , based on the definition of K ∞ in eq. ( 11). For some constant c 2 > 0 that depends only on d, we have ∥δg j (ξ)∥ L ∞ = ∆ 1 a,b,θ g j (ξ) L ∞ ≤ c 2 ℓ -1 ∥D a,b g j (ξ)∥ L ∞ ≤ c 2 j ℓ ∥g j ∥ L ∞ , where D a,b := x a ∂ b -x b ∂ a and the two inequalities follow from (Dai & Xu, 2013, Lem. 4.2.2 (iii) ) and (Dai & Xu, 2013, Lem. 4.2.4) , respectively. Next, we control the first term of eq. ( 61) by writing ∥δK ∞ i (ξ)(g j (ξ) + δg j (ξ))∥ L ∞ ≤ ∥δK ∞ i (ξ)∥ L ∞ g j (Q -1 a,b,θ ξ) L ∞ = ∥δK ∞ i (ξ)∥ L ∞ ∥g j ∥ L ∞ . (64) We fix ξ and define ξ ′ = Q -1 a,b,θ ξ. It follows that δK ∞ i (ξ) = K ∞ (x i , ξ ′ ) -K ∞ (x i , ξ) = 1 4π (x ⊤ i ξ ′ + 1) π-arccos(x ⊤ i ξ ′ ) -(x ⊤ i ξ + 1) π-arccos(x ⊤ i ξ) = 1 4π x ⊤ i (ξ ′ -ξ) π-arccos(x ⊤ i ξ) -(x ⊤ i ξ ′ + 1) arccos(x ⊤ i ξ)-arccos(x ⊤ i ξ ′ ) Next, by the triangle inequality for angles, we have arccos(x ⊤ i ξ) -arccos(x ⊤ i ξ ′ ) ≤ |θ| ≤ ℓ -1 . Since arccos is monotone and |(d/dt) arccos(t)| > 1 for all t, by the fundamental theorem of calculus, we must have x ⊤ i ξ ′ -x ⊤ i ξ ≤ arccos(x ⊤ i ξ) -arccos(x ⊤ i ξ ′ ) ≤ ℓ -1 . As a result, we have |δK ∞ i (ξ)| ≤ 1 4π ℓ -1 π + 2ℓ -1 = π + 2 4π ℓ -1 . By eq. ( 64) and eq. ( 65), we find that ∥δK ∞ i (ξ)(g j (ξ) + δg j (ξ))∥ L ∞ ≤ π + 2 4π ℓ -1 ∥g j ∥ L ∞ . Putting eqs. ( 61) to ( 63) and ( 66) together, we obtain ∆ 1 a,b,θ f ij L ∞ = f ij (ξ) -f ij (Q -1 a,b,θ ξ) L ∞ ≤ c 2 j + π + 2 4π ℓ -1 ∥g j ∥ L ∞ . Since a, b, and θ are chosen arbitrarily, this bound also holds for ω 1 (f ij ; ℓ -1 ). Second, we aim to bound the term dist(g jp , Π d ℓ ) where g jp (ξ) = g j (ξ)g p (ξ). As before, we fix the indices a, b ∈ N where 1 ≤ a < b ≤ d and θ such that |θ| ≤ ℓ -1 . Similarly, we define δg j (ξ) = g j (Q -1 a,b,θ ξ) -g j (ξ), δg p (ξ) = g p (Q -1 a,b,θ ξ) -g p (ξ) . By eq. ( 63), we have g j g p (ξ) -g j g p (Q -1 a,b,θ ξ) L ∞ = ∥g j (ξ)g p (ξ) -(g j (ξ) + δg j (ξ))(g p (ξ) + δg p (ξ))∥ L ∞ ≤ ∥δg j (ξ)(g p (ξ) + δg p (ξ))∥ L ∞ + ∥g j (ξ)δg p (ξ)∥ L ∞ ≤ c 2 j + p ℓ ∥g j ∥ L ∞ ∥g p ∥ L ∞ . Since a, b, and θ are arbitrary numbers, this bound also holds for ω 1 (g jp ; ℓ -1 ). Using these lemmas, we can prove that Theorem 3 asymptotically controls the quadrature errors. Proof of Theorem 3. First, we control ε 1 (k) as |ε 1 (k)| ≤ L j=0 L p=0 (1-2ηµ j ) k (1-2ηµ p ) k e c jp ≤ L j=0 L p=0 e c jp ≤ L j=0 L p=0 2γ n,ℓ ∥g jp ∥ L ∞ + 2A d dist(g jp , Π d ℓ ) ≤ L j=0 L p=0 2γ n,ℓ ∥g j ∥ L ∞ ∥g p ∥ L ∞ + 2A d C j + p ℓ ∥g j ∥ L ∞ ∥g p ∥ L ∞ , where the third and the forth inequalities follow from Lemma 5 and Lemma 6, respectively. The final upper bound for |ε 1 (k)| holds since L j=0 L p=0 (j + p) = O(L 3 ). Next, there exists C 2 > 0 such that we can control ε 2 by writing |ε 2 | ≤ L j=0 √ A d µ j max 1≤i≤n e b ij ≤ L j=0 √ A d µ j 2γ n,ℓ ∥g j ∥ L ∞ + 2A d C j + 1 ℓ ∥g j ∥ L ∞ ≤ C 2   L j=0 µ -1 j   L 2 ℓ + Lγ n,ℓ max j ∥g j ∥ L ∞ , where the second and the third inequalities follow from Lemma 5 and Lemma 6, respectively. The proof is complete. Proof of Corollary 1. By Theorem 3, it suffices to show that max 0≤j≤L ∥g j ∥ L ∞ ≤ C ∥g∥ L 2 for some constant C > 0 that does not depend on g. Since H d i ⊥ H d j in L 2 for i ̸ = j, we have ∥g j ∥ L 2 ≤ ∥g∥ L 2 for 0 ≤ j ≤ L. The claim follows from the fact that ∥•∥ L 2 and ∥•∥ L ∞ are equivalent in Π d L .

E THE THEORY OF FREQUENCY BIAS WITH A H s -BASED LOSS FUNCTION

This section presents detailed proofs for Proposition 2 and Theorem 4 in Section 5, which concerns the frequency bias behavior of NN training using the H s loss function. E.1 A 2-LAYER RELU NEURAL NETWORK IS IN H s (S d-1 ) FOR s < 3/2 We prove that a 2-layer ReLU NN map is contained in H s (S d-1 ) for any s < 3/2 (see Proposition 2). Proof of Proposition 2. Since N can be written as N (x) = m r=1 a r ReLU(w ⊤ r x + b r ), it suffices to prove that f (x) = ReLU(w ⊤ x+b) is in H s for all w ∈ R d , b ∈ R. Since ReLU(ax) = aReLU(x) for any a > 0, we can assume that ∥w∥ 2 = 1. Moreover, since the Sobolev spaces are rotationally invariant, we can assume that w = (1, 0, . . . , 0) ⊤ . Then, f can be written as f (x) = ReLU(x 1 + b). If b ≤ -1 or b ≥ 1, then f (x) is a constant, and thus f ∈ H s (S d-1 ) for all s ∈ R. We assume -1 < b < 1. Then, we have f (x) = x 1 + b, x 1 > -b, 0, x 1 ≤ -b. We define the function S s (f )(x) = ˆπ 0 |I t f (x) -f (x)| 2 t 2s+1 dt, where I t f (x) = C(x,t) f (ξ)dξ, C(x, t) = {ξ ∈ S d-1 | arccos(ξ • x) ≤ t}. Here, ffl C(x,t) f (ξ)dξ = |C(x, t)| -1 ´C(x,t) f (ξ)dξ is the averaged integral, where |C(x, t)| is the Lebesgue measure of C(x, t). Then, by (Barceló et al., 2020, Thm. 1.1) , we have f ∈ H s (S d-1 ) if and only if S s (f ) is integrable. We now show that S s (f ) is integrable on both E 1 = {x ∈ S d-1 | x 1 > -b} and E 2 = {x ∈ S d-1 | x 1 < -b} if s < 3/2. First, we define the function h(x) = x 1 + b. Then, h ∈ H s (S d-1 ). If we can show that S s (h -f )(x) = ˆπ 0 |I t (h -f )(x) -(h -f )(x)| 2 t 2s+1 dt is integrable on E 1 , then we have ˆE1 S s (f )(x)dx = ˆE1 ˆπ 0 |I t f (x) -f (x)| 2 t 2s+1 dtdx ≤ 2 ˆE1 ˆπ 0 |I t h(x) -h(x)| 2 + |I t (h -f )(x) -(h -f )(x)| 2 t 2s+1 dtdx = 2 ˆE1 S s (h)(x)dx + 2 ˆE1 S s (h -f )(x)dx < ∞. Assume x ∈ E 1 , and let ρ be the minimum angular distance between x and any point in the set S = {ξ ∈ S d-1 | ξ 1 = -b}, i.e., ρ = min ξ∈S arccos(ξ • x). Then, for 0 < t ≤ ρ, we clearly have I t (h -f )(x) = 0 = (h -f )(x) . Assume ρ < t < π. We divide C(x, t) into two parts C 1 and C 2 up to a Lebesgue null set, where Li (2011) , we know that the measure of C 2 satisfiesfoot_2  C i = C(x, t) ∩ E i . Then, h -f = 0 on C 1 . Next, by |C 2 | |C(x, t)| = Θ I t 2 -ρ 2 t 2 ; d 2 , 1 2 = Θ B t 2 -ρ 2 t 2 ; d 2 , 1 2 , t → ρ + , where I is the regularized incomplete beta function and B is the incomplete beta function. Here and throughout the proof, the constants in the big-Θ notations are independent of ρ or t, but possibly depend on b and d, which are fixed in the proof. Moreover, we have the formula B t 2 -ρ 2 t 2 ; d 2 , 1 2 = [(t 2 -ρ 2 )/t 2 ] d/2 d/2 F d 2 , 1 2 , d + 2 2 ; t 2 -ρ 2 t 2 , where F is the hypergeometric function that converges to 1 as t → ρ + (Olver et al., 2010, sect. 8.17, sect. 15.2). Hence, we have |C 2 | |C(x, t)| = Θ (t -ρ) d/2 t d/2 , t → ρ + . Now, by the way we defined h -f , there exist constants R 1 , R 2 > 0 such that |h -f | (ξ) ≤ R 1 (t -ρ), ξ ∈ C 2 , |h -f | (ξ) ≥ R 2 (t -ρ), ξ ∈ ζ ∈ C 2 min θ∈S arccos(ζ • θ) ≥ t -ρ 2 . Published as a conference paper at ICLR 2023 This gives us C(x,t) (h -f )(ξ)dξ = Θ (t -ρ) (d+2)/2 t d/2 , t → ρ + . Now, we have S s (h -f )(x) = ˆπ ρ t -2s-1 I t (h -f )(x) 2 dt = ˆπ ρ t -2s-1 C(x,t) (h -f )(ξ)dσ(ξ) 2 dt = Θ ˆπ ρ t -2s-1-d (t -ρ) d+2 dt = Θ(ρ 2-2s + 1). To integrate S s (h -f ) over E 1 , we first change the coordinates and integrate over S d-2 by fixing a ρ. The resulting integral is still in Θ(ρ 2-2s + 1). We then integrate over ρ and the result follows from the fact that a function in Θ(ρ 2-2s + 1) is integrable near ρ = 0 if and only if s < 3/2. This proves S s (h -f ) is integrable on E 1 if and only if s < 3/2. Note that if S s (h -f ) is not integrable on E 1 , then S s (f ) is neither integrable. This proves the proposition when s ≥ 3/2. To see S s (f ) is integrable over E 2 when s < 3/2, we note that f can be rewritten as fh, where f (x) = ReLU(-x 1 -b) and h(x) = -x 1 -b. By the same argument, we have that S s (f ) = S s ( fh) is integrable on E 2 , which completes the proof.

E.2 FREQUENCY BIAS WITH A SQUARED SOBOLEV NORM AS THE LOSS FUNCTION

In this section, we prove Theorem 4 on Sobolev training. Recall that we compute the H s -based loss based on eq. ( 20). Proof of Theorem 4. Fix some 0 ≤ ℓ ≤ L. We can write H ∞ P s y ℓ = ℓmax j=0 N (d,j) p=1 H ∞ ω j a j,p a ⊤ j,p y ℓ = N (d,ℓ) p=1 H ∞ ω ℓ a ℓ,p ĝℓ,p + ℓmax j=0 N (d,j) p=1 H ∞ ω j a j,p e a ℓ,j,p , (71) where ω ℓ = (1 + ℓ) 2s and we used the fact that a ⊤ j,p y ℓ = ˆSd-1 Y j,p (ξ)g ℓ (ξ)dξ + e a ℓ,j,p = ĝℓ,p + e a ℓ,j,p , if ℓ = j, e a ℓ,j,p , otherwise. Next, we have (H ∞ a j,p ) i = ˆSd-1 K ∞ (x i , ξ)Y j,p (ξ)dξ + e b i,j,p = µ j Y j,p (x i ) + e b i,j,p , where the last equality follows from the Funk-Hecke formula. Hence, the first term in eq. ( 71) can be written as   N (d,ℓ) p=1 H ∞ ω ℓ a ℓ,p ĝℓ,p   i = N (d,ℓ) p=1 (µ j Y ℓ,p (x i ) + e b i,ℓ,p )ω ℓ ĝℓ,p = µ j ω ℓ g j (x i ) + N (d,ℓ) p=1 e b i,ℓ,p ω ℓ ĝℓ,p . Moreover, the second term in eq. ( 71) can be written as   ℓmax j=0 N (d,j) p=1 H ∞ ω j a j,p e a ℓ,j,p   i = ℓmax j=0 N (d,j) p=1 ω j e a ℓ,j,p µ j Y j,p (x i ) + e b i,j,p . Therefore, we have  H ∞ P s y ℓ = µ ℓ ω ℓ y ℓ + ω ℓ ε ℓ 1 , Applying eq. ( 72) recursively, we have (I-2ηH ∞ P s ) k y = L ℓ=0 (1 -2ηµ ℓ ω ℓ ) k y ℓ -2η L ℓ=0 k-1 t=0 (1 -2ηµ ℓ ω ℓ ) t (I-2ηH ∞ P s ) k-t-1 ω ℓ ε ℓ 1 ε1 . Now, since H ∞ P s is self-adjoint and positive definite in (R n , ⟨•, •⟩ Ps ), by the way we pick η, we guarantee that I -2ηH ∞ P s is positive definite in (R n , ⟨•, •⟩ Ps ) and hence ∥I -2ηH ∞ P s ∥ Ps ≤ 1 -2ηλ 0 . This gives us ∥ε 1 ∥ Ps ≤ L ℓ=0 µ -1 ℓ ε ℓ 1 Ps , and the result follows from Theorem 1. While Theorem 4 captures the frequency bias in squared H s loss training up to quadrature errors, analyzing the quadrature errors can be task-specific. Therefore, studying the quadrature rules could be a direction of future research. Suppose we use Theorem 1 ′ in the proof, we can write down Theorem 4 in a form such that ϵ 2 (k) depends on the parameters of the NN. Theorem 4 ′ . Suppose g ∈ Π d L and Φ s is the loss function in eq. ( 20), where P s is positive definite and ℓ max ≥ L. Under the assumptions of Theorem 1 ′ , if 1 -2ηµ ℓ (1 + ℓ) 2s > 0 for all 0 ≤ ℓ ≤ L, then with probability ≥ 1 -δ over the random initialization, we have y -u(k) = L ℓ=0 1 -2ηµ ℓ (1 + ℓ) 2s k y ℓ + ε 1 + ε 2 (k), 0 ≤ k ≤ T, where ∥ε 2 (k)∥ Ps is bounded by eq. ( 51) with M P = M Ps , and y ℓ = (g ℓ (x 1 ), . . . , g ℓ (x n )) ⊤ and ε 1 satisfies ∥ε 1 ∥ Ps ≤ L ℓ=0 µ -1 ℓ ∥ε ℓ 1 ∥ Ps , (ε ℓ 1 ) i = e d i,ℓ + ℓmax j=0 (1+j) 2s (1+ℓ) 2s N (d,j) p=1 e a ℓ,j,p µ j Y j,p (x i )+e b i,j,p . Moreover, we note that the remark after the proof of Theorem 2 applies here as well. In particular, using the relation that ∥y -u(k)∥ c = ∥g -N k ∥ 2 L 2 -E c ((g -N k ) 2 ), we can rewrite eq. (21) into ∥g -N k ∥ L 2 = L ℓ=0 1 -2ηµ ℓ (1 + ℓ) 2s k y ℓ c + ε 1 + ε 2 (k) + ε 3 (k), where 57), we can directly relate |ε 1 | ≤ ∥ε 1 ∥ c , |ε 2 (k)| ≤ ∥ε 2 (k)∥ c L ℓ=0 1 -2ηµ ℓ (1 + ℓ) 2s k y ℓ c to L ℓ=0 (1-2ηµ ℓ (1 + ℓ) 2s ) 2k ∥g ℓ ∥ 2 L 2 up to some quadrature errors. Therefore, the generalization error with respect to the uniform distribution can be written as ∥g -N k ∥ L 2 = L ℓ=0 (1-2ηµ j (1 + ℓ) 2s ) 2k ∥g ℓ ∥ 2 L 2 + E NTK + E quad , where E NTK is an error that can be made arbitrarily small by taking κ small enough and m large enough and E quad involves only quadrature errors and can be made arbitrarily small if we make the quadrature rule accurate enough.

F EXPERIMENTAL DETAILS

We now present the details of the three experiments in Section 6.

LEARNING TRIGONOMETRIC POLYNOMIALS ON THE UNIT CIRCLE

In section 6.1, we train a NN with data derived from sampling a trigonometric polynomial at nonuniform points on the unit circle. The n = 1140 nonuniform data points {x i } for this test are generated by taking the union of three sets of equally spaced points, as shown in Figure 1 . The data set {(cos(θ i ), sin(θ i ))} contains 100 equally spaced nodes sampled from θ ∈ (0, 2π], superimposed with 40 equally spaced nodes sampled from θ ∈ [0, 0.3π] and 1000 equally spaced nodes sampled from θ ∈ [1.4π, 1.8π] (see Figure 1 , left). We construct the quadrature weights c j by minimizing n j=1 c 2 j , under the constraints that the c i 's are positive and the quadrature rule is exact on Π 2 55 , i.e., E c (f ) = 0 for all f ∈ Π 2 55 . Note that due to the linearity of the quadrature rule, it suffices to check a finite set of linear constraints E c (Y ℓ,p ) = 0, 0 ≤ ℓ ≤ 55, 1 ≤ p ≤ N (2, ℓ) . This computation method is proposed in (Mhaskar et al., 2000) . Here, 55 is selected to be an integer close to the maximum degree ℓ such that there exists a positive quadrature rule exact on Π d ℓ . In the upper-bounds of the quadrature errors in Theorem 3, there exist terms L 3 ℓ + L 2 γ n,ℓ , L 2 ℓ + Lγ n,ℓ , where L is a constant bandlimit of the target function. By requiring the quadrature rule to be exact on Π 2 ℓ0 , we guarantee that γ n,ℓ0 = 0. Hence, by requiring ℓ 0 to be large, we can heuristically make γ n,ℓ smaller for a moderate ℓ. This allows us to show that the upper-bounds in eq. ( 76) are small by balancing the term that involves ℓ -1 , which vanishes as ℓ → ∞, and the term that involves γ n,ℓ , which increases as ℓ → ∞. This justifies why we compute the quadrature rule by requiring it to be exact on spherical harmonics of degree as large as possible. The experiment consists of two parts. First, we compare the effects of training with the loss function Φ based on eq. ( 1) against the squared L 2 loss function Φ in eq. ( 3), where Φ(W) = A d 2n n i=1 |N (x i ) -g(x i )| 2 , Φ(W) = 1 2 n i=1 c i |N (x i ) -g(x i )| 2 . We define the target function to be g(x) = g(θ) = 9 ℓ=1 sin(ℓθ), x = x(θ) = (cos θ, sin θ) ∈ S 1 , where g(θ) = g(x(θ)). We set up two 2-layer ReLU-activated NNs with 5 × 10 4 hidden neurons in each layer and train them using the same training data and gradient descent procedure, except with different loss functions Φ and Φ. In Figure 5 , we showed the evolution of the NNs trained using Φ and Φ, respectively. While the NN trained with Φ approximates the function very well in the region where the training data are dense (i.e., where θ ∈ [1.4π, 1.8π] ≈ [4.40, 5.65]), the NN trained with Φ provides a better overall approximation on the entire domain and demonstrates frequency bias much more clearly. To evaluate the frequency loss, we collect 100 uniform samples from N (x) and g(x) and compute the Fourier coefficients N (ℓ) and ĝ(ℓ) such that functions N (x) = N (θ) ≈ 30 ℓ=0 N (ℓ)e iℓθ , g(x) = g(θ) ≈ 30 ℓ=0 ĝ(ℓ)e iℓθ , where N (θ) = N (x(θ)). The frequency loss | N (ℓ) -ĝ(ℓ)| estimates how g(x) is approximated by N (x) at the frequency ℓ when training with the different loss functions (see Figure 1 , middle). In addition, we also train the NN to learn each individual frequency with the training data coming from g ℓ = sin(ℓθ) and count the number of iterations it takes to obtain Φ(W) < 1.0×10 -3 (see Figure 1 , right). The second part of the experiment focuses on NN training with a discretized squared Sobolev norm as the loss function. More precisely, we fix some s ∈ R and consider the Sobolev loss function eq. ( 20). For S 1 , we have N (d, ℓ) = 2 if ℓ > 0. We take Y ℓ,1 (x) = sin(ℓθ)/ √ 2π and Y ℓ,2 (x) = cos(ℓθ)/ √ 2π, where the √ 2π factor is a normalization factor. We set ℓ max = 30 in eq. ( 20) and learn g with different s values ranging from -1 to 4. For each s, we compute the frequency loss | N (ℓ) -ĝ(ℓ)| after different numbers of epochs. As s increases, the frequency loss for higher frequencies decays faster (see Figure 6 ). In particular, we see in Figure 2 , the "rainbow" plot, that when s = -1, the lower-frequency losses are much smaller than the higher ones after 5000 iterations, while when s = 3 the higher frequencies are learned faster than the lower ones.

F.2 LEARNING SPHERICAL HARMONICS ON THE UNIT SPHERE

In section 6.2, we train a NN with data derived from sampling a function defined on S 2 at nonuniform points. The training data is a maximum determinant set of 2500 points that comes from the so-called "spherepts" dataset (Wright & Michaels, 2015) . In this experiment, the target function is g(x) = 15 ℓ=1 Y 2ℓ,0 (x), where Y 2ℓ,0 is the normalized zonal spherical harmonic function of degree 2ℓ. Therefore, the spherical harmonic coefficients of g defined in eq. ( 4) satisfy ĝℓ,p = 1 if p = 0 and ℓ = 2, 4, . . . , 30, and ĝℓ,p = 0 otherwise. We then train an NN with the squared H s norm as the loss function (see eq. ( 20)) for s = -1, 0, 2.5. By Theorem 4, we need ℓ max ≥ L = 30. We set ℓ max = 40 in eq. ( 20), assuming that the bandwidth L is not known a priori. We observe frequency bias in this experiment by considering | N ℓ,p -ĝℓ,p | after each epoch for ℓ = 4, 10, 20. We confirm that low frequencies of g are captured earlier in training than high frequencies when s = -1 and s = 0 (see Figure 3 , left and middle), and that this frequency bias phenomena can be counterbalanced by taking s = 2.5 (see Figure 3 , right).

F.3 TEST ON AUTOENCODER

Autoencoders can be used as a generative model to randomly generate new data that is similar to the training data. In this experiment, we use Sobolev-based norms to improve NN training for producing new images of digits that match the MNIST dataset. In our final experiment, we use the same autoencoder architecture as in (Chollet, 2016) , except we train the autoencoder with a different loss function (see section 6.3). Here, for training images {x i }, a standard loss function can be Φ(W) = 1 2 ˆdist(N (x), x)dµ(x) ≈ 1 2n n i=1 dist(N (x i ), x i ), where µ is the distribution of the training images and dist(N (x i ), x i ) is a distance between the output of the NN given by N (x) and the image x. We select the distance metric to measure the difference between N (x) and x as Φ(W) = 1 2n n i=1 ∥N (x i ) -x i ∥ 2 F , where ∥ • ∥ F denotes the matrix Frobenius norm. The distance metric in eq. ( 77) can be viewed as a discretization of the continuous L 2 norm. That is, if one imagines generating a continuous function x : [0, 1] 2 → [0, ∞) that interpolates an image as well as a function that interpolates the NN, then 1 N pixel ∥N (x) -x∥ 2 F ≈ ¨|N (x)(y 1 , y 2 ) -x(y 1 , y 2 )| 2 dy 1 dy 2 = ∥N (x) -x∥ 2 L 2 , where N pixel is the total number of pixels of the image x. In this continuous viewpoint, the H s norm is given by (assuming that the continuous interpolating functions x and N (x) are constructed with periodic boundary conditions) ∥N (x) -x∥ 2 H s = ˆ(1 + |ξ| 2 ) s | N (x)(ξ) -x(ξ)| 2 dξ ≈ ∥S s • F l (N (x) -x) F ⊤ r ∥ 2 F (78) where F l , F r are the left and right 2D-DFT matrices, respectively, (S s ) jℓ = (1 + j 2 + ℓ 2 ) s/2 , and '•' is the Hadamard product. Hence, if we define vec(A) to be the vector obtained by reshaping a matrix A using the column-major order, then we have ∥S s • F l (N (x) -x) F ⊤ r ∥ 2 F = ∥diag(vec(S s ))(F r ⊗ F l )vec(N (x) -x)∥ 2 2 , where ⊗ is the Kronecker product of two matrices. Setting J s = diag(vec(S s ))(F r ⊗ F l ), the loss function in our NN training can be written as Φ s (W) = 1 2n n i=1 ∥J s vec (N (x i ) -x i ) ∥ 2 2 = 1 2n (u -y) ⊤ (I ⊗ J ⊤ s J s )(u -y), ( ) where I is the n-by-n identity matrix and u, y are the vectors of length n × N pixel given by u = (vec(N (x 1 )) ⊤ , . . . , vec(N (x n )) ⊤ ) ⊤ and y = (vec(x 1 ) ⊤ , . . . , vec(x n ) ⊤ ) ⊤ . Hence, the discrete NTK matrix is given by n -1 H ∞ (I ⊗ J ⊤ s J s ), where the (i, j)th sub-block is  H ∞ ij = ∂vec(N (xi;W)) ∂W , ∂vec (i ′ , j ′ )th entry is ∂[vec(N (xi;W)) i ′ ] ∂W , ∂[vec(N (xj ;W)) j ′ ] ∂W . This means that the frequency bias behavior during NN training is directly affected by the choice of s. Alternatively, one can consider this problem more abstractly from an operator learning perspective, which is presented in appendix G. We remark that while eq. ( 79) is a mathematically equivalent expression for the loss function that allows us to easily express the NTK, in practice, we implement the loss function based on eq. ( 78) using a 2D FFT protocol. We use the same autoencoder architecture as in (Chollet, 2016) , except with the loss function in eq. ( 79) for s = -1, 0, 1. We train the autoencoder using mini-batch gradient descent with batch size equal to 256. We first pollute the training images with low-frequency noise and train the NN, hoping that the trained NN will act as a filter for the noise. We see that training the NN with eq. ( 79) for s = 1 gives us the best results due to the high-frequency bias induced by choice of the loss function. Although H ∞ is low-frequency bias, the high-frequency bias of J ⊤ s J s dominates for sufficiently large s. In that case, the low-frequency noise barely changes the training in the earlier epochs as the low-frequency components of the residual correspond to small eigenvalues of H ∞ (I ⊗ J ⊤ s J s ). Similar results are discussed in (Engquist et al., 2020) and (Zhu et al., 2021) in the inverse problem and image processing contexts, respectively. The opposite phenomenon occurs when we add high-frequency noise (see Figure 4 , bottom row). Since H ∞ by itself makes the NN training procedure bias towards low-frequencies, the output for s = 0 does already filter high-frequency noise. Since J ⊤ s J s for s < 0 further biases towards lowfrequencies, one can obtain better high-frequency filters. We observe that the best denoising results for the autoencoder come from selecting s = -1 (see Figure 4 , bottom row).

G NTK AND FREQUENCY BIAS IN OPERATOR LEARNING

We saw how using the H s -based losses can help us tune frequency bias in training the autoencoder (see section 6.3). In fact, in training the autoencoder, we are learning the identity operator on the space of images. In this section, we briefly mention the NTK associated with operator learning and its consequences. Let L be a linear operator on L 2 (S d-1 ). Let D = {f 1 , . . . , f N } be a finite subset of L 2 (S d-1 ), and x 1 , . . . , x M be distinct "samplers" in S d-1 . Since L is linear, without loss of generality, we assume (f i (x 1 ), . . . , f i (x M )) ⊤ 2 = 1 for i = 1, . . . , N . Given a function f , we use f (x) to denote the vector (f (x 1 ), . . . , f (x M )) ⊤ . We consider a fully-connected two-layer ReLU NN, N = (N 1 , . . . , N M ), that takes M inputs and produces M outputs. The goal of training is to learn the linear operator L on D. That is, we want N j (f (x 1 ), . . . , f (x M )) ≈ (Lf )(x j ). In other words, the samples f (x) of a function f , we want the NN to output the values of Lf at x 1 , . . . , x M , i.e., (Lf )(x). As before, we let P be an M × M symmetric positive definite matrix that measures the distance between two L 2 functions given their samples at x 1 , . . . , x M . That is, D(f (x), g(x)) = (f (x) -g(x)) ⊤ P(f (x) -g(x)). We then consider a loss function of the operator NN given by Φ(W) = 1 N N i=1 D((Lf i )(x), N (f i )), where N (f i ) := (N 1 (f i ), . . . , N M (f i )) ⊤ in which N j (f i ) denotes the jth output of the NN when we input f i . That is, N j (f i ) = N j (f i (x)). Let w i,j be the ith weight in the jth neuron of the first hidden layer, 1 ≤ i ≤ M and 1 ≤ j ≤ m, where m is the number of hidden neurons. Let b j be the bias term of the jth neuron in the hidden layer. Let w (2) i,j be the ith weight in the jth neuron of the output layer, so 1 ≤ i ≤ m and 1 ≤ j ≤ M . We assume the same initialization scheme. That is, w (1) i,j are initialized from iid Gaussian and b j are initialized to 0, and they are updated during training, whereas w (2) i,j are initialized from iid Rademacher random variables and are not updated during training. In the derivation of the NTK below, we let W to be the vector of all trainable weights and biases, w (1) i,j and b j , enumerated in an (arbitrary) fixed order. This allows us to write W = (w 1 , . . . , w K ) ⊤ , where K = m(M + 1). We assume that the gradient flow algorithm is used to train the NN, i.e., dW dt = -∂Φ ∂W . We define the vector of labels by y i = (Lf i (x 1 ), . . . , Lf i (x M )) ⊤ , y = (y ⊤ 1 , . . . , y ⊤ N ) ⊤ . Similarly, we define the vector of the NN outputs by u i (W) = (N 1 (f i ; W), . . . , N M (f i ; W)) ⊤ , u = (u ⊤ 1 , . . . , u ⊤ N ) ⊤ . Our goal is to understand d dt (y -u) and write it in terms of yu. To this end, we first consider d(y i -u i (W)) dt = - du i (W) dt = - du i (W) dW dW dt = du i (W) dW ∂Φ(W) ∂W . ( ) Denote by w k the kth entry of W, where 1 ≤ k ≤ K. Then, by the chain rule, the kth entry of

∂Φ(W)

∂W can be written as ∂Φ(W) ∂W k = ∂Φ(W) ∂w k = 1 N N i ′ =1 ∂ ∂w k D ((Lf i ′ )(x), N (f i ′ )(x)) = 1 N N i ′ =1 ∂D ((Lf i ′ )(x), N (f i ′ )(x)) ∂N (f i ′ ; W) ∂N (f i ′ ; W) ∂w k = - 1 N N i ′ =1 ∂N1(f i ′ ) ∂w k • • • ∂N M (f i ′ ) ∂w k P    Lf i ′ (x 1 ) -N 1 (f i ′ ) . . . Lf i ′ (x M ) -N M (f i ′ )    . ∂Φ(W) ∂W = - 1 N N i ′ =1     ∂N1(f i ′ ) ∂w1 • • • ∂N M (f i ′ ) ∂w1 . . . . . . . . . ∂N1(f i ′ ) ∂w K • • • ∂N M (f i ′ ) ∂w K     P    Lf i ′ (x 1 ) -N 1 (f i ′ ) . . . Lf i ′ (x M ) -N M (f i ′ )    . Now, combining eq. ( 81) with eq. ( 82), we have - du i (W) dt = - 1 N     ∂N1(fi) ∂w1 • • • ∂N1(fi) ∂w K . . . . . . . . . ∂N M (fi) ∂w1 • • • ∂N M (fi) ∂w K     ×          N i ′ =1     ∂N1(f i ′ ) ∂w1 • • • ∂N M (f i ′ ) ∂w1 . . . . . . . . . ∂N1(f i ′ ) ∂w K • • • ∂N M (f i ′ ) ∂w K     J i ′ P    Lf i ′ (x 1 ) -N 1 (f i ′ ) . . . Lf i ′ (x M ) -N M (f i ′ )             , which gives us d(y -u(W)) dt = - du(W) dt = - 1 N      J ⊤ 1 J 1 • • • J ⊤ 1 J N . . . . . . . . . J ⊤ N J 1 • • • J ⊤ N J N      (I N ⊗ P)(y -u). The derivation so far does not rely on the architecture of the NN. In fact, it holds for the loss function Φ defined in eq. ( 80) with any abstract function N (f i ; W). Now, we exploit the NN architecture to study J ⊤ i J i ′ . Consider the (j, j ′ )th entry of J ⊤ i J i ′ . We have (J ⊤ i J i ′ ) (j,j ′ ) = 1 m m k=1 w k,j w (2) k,j ′ f i (x) ⊤ f i ′ (x) + 1 2 1 {fi(x) ⊤ w (1) k +b (1) k ≥0,f i ′ (x) ⊤ w (1) k +b (1) k ≥0} , where w (1) k = (w 1,k , . . . , w M,k ) is the collection of weights on the kth hidden neuron. Now, note that w (2) k,j w (2) k,j ′ = 1 if j = j ′ and is a Rademacher random variable when j ̸ = j ′ . Hence, by the way we initialize w (1) i,j , we have J ⊤ i J i ′ m→∞ ----→ (f i (x) ⊤ f i ′ (x) + 1)(π -arccos(f i (x) ⊤ f i ′ (x))) 4π I M , where the convergence is entry-wise. Now, suppose we define the N × N matrix H ∞ as H ∞ i,i ′ = (f i (x) ⊤ f i ′ (x) + 1)(π -arccos(f i (x) ⊤ f i ′ (x))) 4π . ( ) That is, we define H ∞ as in the function learning case, except that the inputs are {f i (x)} N i=1 instead of {x i } n i=1 . By eq. ( 85), we have      J ⊤ 1 J 1 • • • J ⊤ 1 J N . . . . . . . . . J ⊤ N J 1 • • • J ⊤ N J N      m→∞ ----→ H ∞ ⊗ I M , where the convergence is entrywise. Hence, if we discretize the gradient flow algorithm and apply gradient descent with step size η, then by eq. ( 83) and eq. ( 87), we expect that y i -u i (k) ≈ (y i -u i (k -1)) - η N N i ′ =1 H ∞ i,i ′ P(y i ′ -u i ′ (k -1)), or otherwise written more compactly, y -u(k) ≈ I M N - η N (H ∞ ⊗ I M )(I N ⊗ P) k y (89) if the variance in the initialization is small enough. Given this characterization of the residual equation 89, we can see how the usage of the Sobolev loss plays a role in tuning frequency bias. Suppose we decompose the residual vector into y i -u i (k) = ∞ ℓ=0 N (d,ℓ) p=1 γℓ,p i Y ℓ,p , where Y ℓ,p = (Y ℓ,p (x 1 ), . . . , Y ℓ,p (x M )) ⊤ is the evaluation of the spherical harmonic on S d-1 . Then, γℓ,p i measures the amount of frequency loss associated with the (ℓ, p)th frequency in learning the function f i . Let γℓ,p (k) = (γ ℓ,p i (k), . . . , γℓ,p N (k)) ⊤ . The size of γℓ,p (k) measures the amount of overall frequency-ℓ components (in the pth direction) in the residuals after the kth iteration. Assume ω ℓ is an eigenvalue of P associated with the eigenspace that is approximately spanned by {Y ℓ,p } N (d,ℓ) p=1 . Note that this is the case when P = P s is associated with the squared-H s loss on S d-1 . Then, by eq. ( 88), we have γℓ,p (k) ≈ I N - η N ω ℓ H ∞ k ŷℓ,p , where ŷℓ,p = (ŷ ℓ,p 1 , . . . , ŷℓ,p N ) is defined by the decomposition y i = ∞ ℓ=0 N (d,ℓ) p=1 ŷℓ,p i Y ℓ,p . Equation (90) demonstrates that if ω ℓ is relatively large for bigger ℓ (e.g., when we use the squared-H s loss with a large s > 0), then we expect that γ ℓ,p (k) decays much faster for high frequencies than the low ones, and vice versa. This connects frequency bias in operator learning to the spectral properties of P s . We studied and justified the phenomena in the autoencoder experiment in Sec. 6.3.

H COMPUTATION OF QUADRATURE WEIGHTS

In practice, the training dataset usually does not come with a carefully designed quadrature rule. Hence, we inevitably need to compute a set of quadrature weights before training the NN. In this section, we briefly discuss methods for computing positive quadrature weights. Given a set of points {x i } n i=1 on S d-1 , we wish to construct a quadrature rule so that I n (f ) := n i=1 c i f (x i ) ≈ ˆSd-1 f (x)dx for sufficiently smooth f , where c i > 0 are positive quadrature weights. One approach that could give us a very accurate quadrature rule is to guarantee that I n (f ) = ˆSd-1 f (x)dx, f ∈ Π d ℓ , where dim(Π d ℓ ) ≤ n. The one-dimensional case of such quadrature rules was studied in (Austin & Trefethen, 2017; Yu & Townsend, 2022) and the general higher-dimensional case was analyzed in (Mhaskar et al., 2000; Dai & Xu, 2013) . Given any dataset {x i } n i=1 and ℓ so that dim(Π d ℓ ) ≤ n, one cannot guarantee the existence of a positive quadrature rule satisfying eq. ( 91) (Mhaskar et al., 2000; Dai & Xu, 2013) , even when the distribution of {x i } n i=1 is very regular (Yu & Townsend, 2022) . On the other hand, by choosing ℓ to be small, we can eventually find an ℓ for which eq. ( 91) holds. When such a positive quadrature rule exists, Mhaskar et al. (2000) proposed to solve the While eq. ( 91) gives us a guarantee on the accuracy of the quadrature rule (provided ℓ is not too small), it is not always practical to compute the quadrature weights in this way. Indeed, if d is large, then we need a tremendous amount of points to guarantee that dim(Π d ℓ ) ≤ n even for a small ℓ. Also, if we have too many data points, then the quadratic program can get infeasible to solve. Hence, we need some other methods for computing quadrature weights that, albeit less accurate, can be applied more cheaply to general datasets. One of the many possible approaches is to do kernel density estimation (Rosenblatt, 1956; Parzen, 1962) . To do so, we fix a positive kernel K(x, y) = K(arccos(x ⊤ y)) defined on S d-1 × S d-1 . A common choice of K can be the Gaussian density function of standard deviation 1 centered at 0. For each h 0, we then define a function p h on S d-1 by p h (x) = n i=1 K arccos(x ⊤ x i ) h . The bandwidth h is a hyperparameter, and with an appropriate h, the function p h is an (unnormalized) estimate of the density function of the distribution of nodes. Hence, by setting c i = A d p -1 h (x i ) n j=1 p -1 h (x j ) , we obtain a positive quadrature rule that approximates the integral of smooth functions on S d-1 .



In the case where we do not have a good quadrature rule associated with {xi} n i=1 , we usually have very limited understanding of the spectral property of the target function given its values at {xi} n i=1 . Hence, we assume the existence of a good quadrature rule for theoretical purposes. Otherwise, if such i2 does not exist, we can add an element ϕn+1 associated with -xi 1 to {ϕi} n i=1 . If we can show {ϕi} n+1 i=1 is linearly independent, then so is {ϕi} n i=1 . For two functions α(t) and β(t), we say α(t) = Θ(β(t)) as t → ρ + if there exist positive constants C l , Cr and radius r > 0 such that C l β(t) ≤ α(t) ≤ Crβ(t) for all 0 < t -ρ < r.



Figure 1: Left: the nonuniform data x = (cos θ, sin θ) on the unit circle S 1 . Middle: the change of frequency loss for ℓ= 1 (blue), 5 (red) and 9(yellow) against the number of iterations for loss function Φ (solid lines) and Φ (dashed lines). Right: the number of iterations for the NN training to achieve a fixed loss threshold in learning g ℓ (x) = sin(ℓθ) for 3 ≤ ℓ ≤ 10 given the loss function Φ. The black line represents the O(ℓ 2 ) rate based on the analysis in(Basri et al., 2019).

Figure 2: Frequency bias during NN training with a squared H s loss function. The blue-to-red rainbow corresponds to low-to-high frequency losses | N (ℓ) -ĝ(ℓ)| for the frequency index ℓ ranging from 1 (blue) to 9 (red) with nonuniform training data, respectively. Here, an overparameterized 2-layer ReLU NN N (x) is trained for 5000 epochs to learn function g(x) given the H s loss with -1 ≤ s ≤ 4. The inversion of the rainbow demonstrates the reversal of frequency bias.

Figure 3: Frequency losses for ℓ = 4 (blue), 10 (red), and 20 (yellow), when learning a function on S 2 using different squared H s loss functions. Compared to low-frequency bias in the cases of s = -1 (left) and s = 0 (middle), we observe a high-frequency bias when s = 2.5 (right).

Figure 4: Outputs of a squared H s loss-trained autoencoder on a typical test image when the input images are contaminated by low-frequency noise (top row) and high-frequency noise (bottom row).

52)where '•' is the Hadamard product,u j (k) = [u j (x 1 ; W(k)), • • • , u j (x n ; W(k))] ⊤ , and z j (k) = [z j (x 1 ; W(k)), • • • , z j (x n ; W(k))] ⊤ .Suppose we define the vectorized residual z = z ⊤ 1 , . . . , z ⊤ p ⊤ and define the np × np block matrix J(k) by J(k) (in+1):(i+1)n,(jn+1):(j+1)n =

,p µ j Y j,p (x i ) + e b i,j,p .

are as in Theorem 4, and |ε 3 (k)| ≤ |E c (g -N k ) 2 |. By eq. (

The target function (black, dotted), the NN trained with Φ (blue, solid), and the NN trained with Φ (red, solid). The purple bars on the horizontal axes show the positions of the training data.

Figure 6: Frequency loss for ℓ = 3 (blue), ℓ = 5 (red), and ℓ = 9 (yellow) based on the squared H s norm as the loss function. Left: s = -1; Middle: s = 0; Right: s = 1. The error bars are generated using the results obtained by executions with thirty different random seeds. In the left figure, the result of one of the thirty executions is omitted because the NN is trapped by a local minimizer that is not a global one, making the NN not converge to the target function.

N (xj ;W)) ∂W for i, j = 1, . . . , n. Here, ∂vec(N (xi;W)) ∂W , ∂vec(N (xj ;W)) ∂W is interpreted as the N pixel -by-N pixel matrix whose

. c i > 0 ∀1 ≤ i ≤ n, n i=1 c i Y j,p (x i ) = ˆSd-1 Y j,p (x)dx ∀0 ≤ j ≤ ℓ, 1 ≤ p ≤ N (d, j).

ACKNOWLEDGMENTS

We thank Aparna Gupte and Liu Zhang for discussions and some initial experiments. The work was supported by the National Science Foundation grants DMS-1913129, DMS-1952757, and DMS-2045646, as well as the Simon Foundation. Y. Yang acknowledges support from Dr. Max Rössler, the Walter Haefner Foundation, and the ETH Zürich Foundation. We also gratefully thank the anonymous referees for their constructive comments, suggestions, and discussions.

