TUNING FREQUENCY BIAS IN NEURAL NETWORK TRAINING WITH NONUNIFORM DATA

Abstract

Small generalization errors of over-parameterized neural networks (NNs) can be partially explained by the frequency bias phenomenon, where gradient-based algorithms minimize the low-frequency misfit before reducing the high-frequency residuals. Using the Neural Tangent Kernel (NTK), one can provide a theoretical analysis for training where data are drawn from constant or piecewise-constant probability densities. Since most training data sets are not drawn from such distributions, we use the NTK model and a data-dependent quadrature rule to theoretically quantify the frequency bias of NN training given nonuniform data. By replacing the loss function with a selected Sobolev norm, we can amplify or dampen the intrinsic frequency bias in NN training.

1. INTRODUCTION

Neural networks (NNs) are often trained in supervised learning on a small data set. They are observed to provide accurate predictions for a large number of test examples that are not seen during training. A mystery is how training can achieve small generalization errors in an overparameterized NN and a so-called "double-descent" risk curve (Belkin et al., 2019) . In recent years, a potential answer has emerged called "frequency bias," which is the phenomenon that in the early epochs of training, an overparameterized NN finds a low-frequency fit of the training data while higher frequencies are learned in later epochs (Rahaman et al., 2019; Yang & Salman, 2019; Xu, 2020) . In addition to generalization errors, it is often useful to understand the convergence rate for each spectral component of the data mismatch to study the robustness of the NN under noises. Currently, frequency bias is theoretically understood via the Neural Tangent Kernel (NTK) (Jacot et al., 2018) for uniform training data (Arora et al., 2019; Cao et al., 2019; Basri et al., 2019) and data distributed according to a piecewise-constant probability measure (Basri et al., 2020) . However, most training data sets in practice are highly clustered and not uniform. Yet, frequency bias is still observed during NN training (Fridovich-Keil et al., 2021) , even though the theory is absent. This paper proves that frequency bias is present when there is nonuniform training data by using a new viewpoint based on a data-dependent quadrature rule. We use this theory to propose new loss functions for NN training that accelerate its convergence and improve stability with respect to noise. An NN function is a map N : R d → R given by N (x) = W N L σ (W N L -1 σ (• • • (W 2 σ (W 1 x + b 1 ) + b 2 ) + • • • ) + b N L -1 ) + b N L , where W i ∈ R mi×mi-1 are weights, m 0 = d, b i ∈ R mi are biases, and N L is the number of layers. Here, σ is the activation function applied entry-wise to a vector, i.e., σ(a) j = σ(a j ). In this paper, we consider ReLU NNs, which are NNs for which σ is the ReLU function given by ReLU(t) = max(t, 0). Since ReLU(αt) = αReLU(t) for any α > 0, we assume that the input of the NN is normalized so that x ∈ S d-1 . To introduce a continuous perspective, we assume that there is an underlying target function g : S d-1 → R and the training samplers x i 's follow distribution µ(x). Given training data {(x i , y i )} n i=1 drawn from g, where x i ∈ S d-1 and y i ≈ g(x i ), our goal is to train the NN, in a way that is robust when the sampling of y i from g(x i ) is contaminated by noises, so that N uniformly approximates g on S d-1 . One standard training procedure is a gradient-based optimization algorithm that minimizes the residual in the squared L 2 (dµ) norm, i.e., 



A d 2 ˆSd-1 |g(x) -N (x; W)| 2 dµ(x) ≈ A d 2n n i=1 |y i -N (x i ; W)| 2 ,(1)

