EFFECTIVE REGULARIZATION THROUGH LOSS-FUNCTION METALEARNING

Abstract

Loss-function metalearning can be used to discover novel, customized loss functions for deep neural networks, resulting in improved performance, faster training, and improved data utilization. A likely explanation is that such functions discourage overfitting, leading to effective regularization. This paper demonstrates theoretically that this is indeed the case for the TaylorGLO method: Decomposition of learning rules makes it possible to characterize the training dynamics and show that the loss functions evolved by TaylorGLO balance the pull to zero error, and a push away from it to avoid overfitting. This observation leads to an invariant that can be utilized to make the metalearning process more efficient in practice, and result in networks that are robust against adversarial attacks. Loss-function optimization can thus be seen as a well-founded new aspect of metalearning in neural networks.

1. INTRODUCTION

Regularization is a key concept in deep learning: it guides learning towards configurations that are likely to perform robustly on unseen data. Different regularization approaches originate from intuitive understanding of the learning process and have been shown to be effective empirically. However, the understanding of the underlying mechanisms, the different types of regularization, and their interactions, is limited. Recently, loss function optimization has emerged as a new area of metalearning, and shown great potential in training better models. Experiments suggest that metalearned loss functions serve as regularizers in a surprising but transparent way: they prevent the network from learning too confident predictions (e.g. Baikal loss; Gonzalez & Miikkulainen, 2020a) . While it may be too early to develop a comprehensive theory of regularization, given the relatively nascent state of this area, it may be possible to make progress in understanding regularization of this specific type. That is the goal of this paper. Since metalearned loss functions are customized to a given architecture-task pair, there needs to be a shared framework under which loss functions can be analyzed and compared. The TaylorGLO (Gonzalez & Miikkulainen, 2020b) technique for loss function metalearning lends itself well to such analysis: It represents loss functions as multivariate Taylor polynomials, and leverages evolution to optimize a fixed number of parameters in this representation. In this framework, the SGD learning rule is decomposed to coefficient expressions that can be defined for a wide range of loss functions. These expressions provide an intuitive understanding of the training dynamics in specific contexts. Using this framework, mean squared error (MSE), cross-entropy, Baikal, and TaylorGLO loss functions are analyzed at the null epoch, when network weights are similarly distributed (Appendix C), and in a zero training error regime, where the training samples' labels have been perfectly memorized. For any intermediate point in the training process, the strength of the zero training error regime as an attractor is analyzed and a constraint on this property is derived on TaylorGLO parameters by characterizing how the output distribution's entropy changes. In a concrete TaylorGLO loss function that has been metalearned, these attraction dynamics are calculated for individual samples at every epoch in a real training run, and contrasted with those for the cross-entropy loss. This comparison provides clarity on how TaylorGLO avoids becoming overly confident in its predictions. Further, the analysis shows (in Appendix D.2) how label smoothing (Szegedy et al., 2016) , a traditional type of regularization, can be implicitly encoded by TaylorGLO loss functions: Any representable loss function has label-smoothed variants that are also representable by the parameterization. From these analyses, practical opportunities arise. First, at the null epoch, where the desired behavior can be characterized clearly, an invariant can be derived on a TaylorGLO loss function's parameters that must hold true for networks to be trainable. This constraint is then applied within the TaylorGLO algorithm to guide the search process towards good loss functions more efficiently. Second, lossfunction-based regularization results in robustness that should e.g. make them more resilient to adversarial attacks. This property is demonstrated experimentally by incorporating adversarial robustness as an objective within the TaylorGLO search process. Thus, loss-function metalearning can be seen as a well-founded and practical approach to effective regularization in deep learning.

2. BACKGROUND

Regularization traditionally refers to methods for encouraging smoother mappings by adding a regularizing term to the objective function, i.e., to the loss function in neural networks. It can be defined more broadly, however, e.g. as "any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error" (Goodfellow et al., 2015) . To that end, many regularization techniques have been developed that aim to improve the training process in neural networks. These techniques can be architectural in nature, such as Dropout (Srivastava et al., 2014) and Batch Normalization (Ioffe & Szegedy, 2015) , or they can alter some aspect of the training process, such as label smoothing (Szegedy et al., 2016) or the minimization of a weight norm (Hanson & Pratt, 1989) . These techniques are briefly reviewed in this section, providing context for loss-function metalearning.

2.1. IMPLICIT BIASES IN OPTIMIZERS

It may seem surprising that overparameterized neural networks are able to generalize at all, given that they have the capacity to memorize a training set perfectly, and in fact sometimes do (i.e., zero training error is reached). Different optimizers have different implicit biases that determine which solutions are ultimately found. These biases are helpful in providing implicit regularization to the optimization process (Neyshabur et al., 2015) . Such implicit regularization is the result of a network norm-a measure of complexity-that is minimized as optimization progresses. This is why models continue to improve even after training set has been memorized (i.e., the training error global optima is reached) (Neyshabur et al., 2017) . For example, the process of stochastic gradient descent (SGD) itself has been found to provide regularization implicitly when learning on data with noisy labels (Blanc et al., 2020) . In overparameterized networks, adaptive optimizers find very different solutions than basic SGD. These solutions tend to have worse generalization properties, even though they tend to have lower training errors (Wilson et al., 2017) .

2.2. REGULARIZATION APPROACHES

While optimizers may minimize a network norm implicitly, regularization approaches supplement this process and make it explicit. For example, a common way to restrict the parameter norm explicitly is through weight decay. This approach discourages network complexity by placing a cost on weights (Hanson & Pratt, 1989) . Generalization and regularization are often characterized at the end of training, i.e. as a behavior that results from the optimization process. Various findings have influenced work in regularization. For example, flat landscapes have better generalization properties (Keskar et al., 2017; Li et al., 2018; Chaudhari et al., 2019) . In overparameterized cases, the solutions at the center of these landscapes may have zero training error (i.e., perfect memorization), and under certain conditions, zero training error empirically leads to lower generalization error (Belkin et al., 2019; Nakkiran et al., 2019) . However, when a training loss of zero is reached, generalization suffers (Ishida et al., 2020) . This behavior can be thought of as overtraining, and techniques have been developed to reduce it at the end of the training process, such as early stopping (Morgan & Bourlard, 1990) and flooding (Ishida et al., 2020) .

