TUNING FREQUENCY BIAS IN NEURAL NETWORK TRAINING WITH NONUNIFORM DATA

Abstract

Small generalization errors of over-parameterized neural networks (NNs) can be partially explained by the frequency bias phenomenon, where gradient-based algorithms minimize the low-frequency misfit before reducing the high-frequency residuals. Using the Neural Tangent Kernel (NTK), one can provide a theoretical analysis for training where data are drawn from constant or piecewise-constant probability densities. Since most training data sets are not drawn from such distributions, we use the NTK model and a data-dependent quadrature rule to theoretically quantify the frequency bias of NN training given nonuniform data. By replacing the loss function with a selected Sobolev norm, we can amplify or dampen the intrinsic frequency bias in NN training.

1. INTRODUCTION

Neural networks (NNs) are often trained in supervised learning on a small data set. They are observed to provide accurate predictions for a large number of test examples that are not seen during training. A mystery is how training can achieve small generalization errors in an overparameterized NN and a so-called "double-descent" risk curve (Belkin et al., 2019) . In recent years, a potential answer has emerged called "frequency bias," which is the phenomenon that in the early epochs of training, an overparameterized NN finds a low-frequency fit of the training data while higher frequencies are learned in later epochs (Rahaman et al., 2019; Yang & Salman, 2019; Xu, 2020) . In addition to generalization errors, it is often useful to understand the convergence rate for each spectral component of the data mismatch to study the robustness of the NN under noises. Currently, frequency bias is theoretically understood via the Neural Tangent Kernel (NTK) (Jacot et al., 2018) for uniform training data (Arora et al., 2019; Cao et al., 2019; Basri et al., 2019) and data distributed according to a piecewise-constant probability measure (Basri et al., 2020) . However, most training data sets in practice are highly clustered and not uniform. Yet, frequency bias is still observed during NN training (Fridovich-Keil et al., 2021) , even though the theory is absent. This paper proves that frequency bias is present when there is nonuniform training data by using a new viewpoint based on a data-dependent quadrature rule. We use this theory to propose new loss functions for NN training that accelerate its convergence and improve stability with respect to noise. An NN function is a map N : R d → R given by N (x) = W N L σ (W N L -1 σ (• • • (W 2 σ (W 1 x + b 1 ) + b 2 ) + • • • ) + b N L -1 ) + b N L , where W i ∈ R mi×mi-1 are weights, m 0 = d, b i ∈ R mi are biases, and N L is the number of layers. Here, σ is the activation function applied entry-wise to a vector, i.e., σ(a) j = σ(a j ). In this paper, we consider ReLU NNs, which are NNs for which σ is the ReLU function given by ReLU(t) = max(t, 0). Since ReLU(αt) = αReLU(t) for any α > 0, we assume that the input of the NN is normalized so that x ∈ S d-1 . To introduce a continuous perspective, we assume that there is an underlying target function g : S d-1 → R and the training samplers x i 's follow distribution µ(x). Given training data {(x i , y i )} n i=1 drawn from g, where x i ∈ S d-1 and y i ≈ g(x i ), our goal is to train the NN, in a way that is robust when the sampling of y i from g(x i ) is contaminated by noises, so that N uniformly approximates g on S d-1 . One standard training procedure is a gradient-based optimization algorithm that minimizes the residual in the squared L 2 (dµ) norm, i.e., Φ(W) = A d 2 ˆSd-1 |g(x) -N (x; W)| 2 dµ(x) ≈ A d 2n n i=1 |y i -N (x i ; W)| 2 , ( ) where A d is the Lebesgue measure of the hypersphere S d-1 and W represents the weights and bias terms. Similar to most theoretical studies investigating frequency bias, we restrict ourselves to 2-layer NNs (Arora et al., 2019; Basri et al., 2019; Su & Yang, 2019; Cao et al., 2019) . To study NN training, it is common to consider the dynamics of Φ(W) as one optimizes the coefficients in W. For example, the gradient flow of the NN weights is given by dW dt = -∂Φ ∂W . Define the residual z(x; W) = g(x) -N (x; W). Applying gradient flow with the population loss gives us dz(x; W) dt = -A d ˆSd-1 K(x, x ′ ; W)z(x ′ ; W)dµ(x ′ ), where K(x, x ′ ; W) = ∂N (x;W) ∂W , ∂N (x ′ ;W)

∂W

. Under the assumptions that the weights do not change much during training, one can consider the NTK given the underlying time-independent distribution of W, i.e., et al., 2018) . Based on eq. ( 2), one can understand the decay of the residual by studying the reproducing kernel Hilbert space (RKHS) through a spectral decomposition of the integral operator L defined by (Lz)(x) = ´K∞ (x, x ′ )z(x ′ )dµ(x ′ ). Most results in the literature require µ(x) to be the uniform distribution over the sphere so that the eigenfunctions of L are spherical harmonics and the eigenvalues have explicit forms (Cao et al., 2019; Basri et al., 2019; Scetbon & Harchaoui, 2021) . These explicit formulas for the eigenvalues and eigenfunctions of L rely on the Funk-Hecke theorem, which provides a formula allowing one to express an integral over a hypersphere by an integral over an interval (Seeley, 1966) . The frequency bias of NN training can be explained by the fact that low-degree spherical harmonic polynomials are eigenfunctions of L associated with large eigenvalues (Basri et al., 2019) . Thus, for uniform training data, the optimization of the weights and biases of an NN tends to fit the low-frequency components of the residual first. K ∞ (x, x ′ ) = E W [K(x, x ′ ; W)] (Du When µ(x) is nonuniform, it is difficult to analyze the spectral properties of L and thus the frequency bias properties of NN training. Since the Funk-Hecke formula no longer holds, there are only a few special cases where frequency bias is understood (Williams & Rasmussen, 2006, Sec. 4.3) . Although one may derive asymptotic bounds for the eigenvalues (Widom, 1963; 1964; Bach & Jordan, 2002) , it is hard to obtain formulas for the eigenfunctions, and one usually relies on numerical approximations (Baker, 1977) . For the ReLU-based NTK, Basri et al. (2020) provided explicit eigenfunctions assuming that the µ(x) is piecewise constant on S 1 , but this analysis does not generalize to higher dimension. To study the frequency bias of NN training, one needs to understand both the eigenvalues and eigenfunctions of L, and this remains a significant challenge for a general µ(x) due to the absence of the Funk-Hecke formula. To overcome this challenge, we take a different point-of-view. While it is standard to discretize the integral in eq. ( 1) using a Monte Carlo-like average, we discretize it using a data-dependent quadrature rule where the nodes are at the training data. That is, we investigate the frequency bias of NN training when minimizing the residual in the standard squared L 2 norm: Φ(W) = 1 2 ˆSd-1 |g(x) -N (x; W)| 2 dx ≈ 1 2 n i=1 c i |y i -N (x i ; W)| 2 , where c 1 , . . . , c n are the quadrature weights associated with the (nonuniform) input data x 1 , . . . , x n . If x 1 , . . . , x n are drawn from a uniform distribution over the hypersphere, then one can select c i = A d /n for 1 ≤ i ≤ n; otherwise, one can choose any quadrature weights so that the integration rule is accurate (see section 4.2). If x 1 , . . . , x n are drawn independently at random from µ(x), then it is often reasonable to select c i = 1/(np(x i )), where dµ(x) = p(x)dx. While c 1 , . . . , c n depend on x 1 , . . . , x n , the continuous expression for Φ(W) is always unaltered in eq. ( 3). Therefore, we can use the Funk-Hecke formula to analyze the eigenvalues and eigenfunctions of L defined by ( Lz)(x) = ´Sd-1 K ∞ (x, x ′ )z(x ′ )dx ′ , revealing the frequency bias. We address that by choosing eq. ( 3) as a loss function instead of eq. ( 1), we are enforcing the frequency bias of NN training, whereas eq. ( 1) does not ensure such spectral property (see Section 6.1 for an illustration). To further tune the NN frequency bias during training, we also propose to minimize the residual in a squared Sobolev H s norm for a carefully selected s ∈ R. Unlike the L 2 norm (the case of s = 0), the H s norm for s ̸ = 0 has its own frequency bias. For s > 0, H s penalizes high frequencies more than low, while for s < 0, low frequencies are penalized the most. We implement the squared H s norm using a quadrature rule, which induces a different integral operator L s . We

