EFFECTIVE REGULARIZATION THROUGH LOSS-FUNCTION METALEARNING

Abstract

Loss-function metalearning can be used to discover novel, customized loss functions for deep neural networks, resulting in improved performance, faster training, and improved data utilization. A likely explanation is that such functions discourage overfitting, leading to effective regularization. This paper demonstrates theoretically that this is indeed the case for the TaylorGLO method: Decomposition of learning rules makes it possible to characterize the training dynamics and show that the loss functions evolved by TaylorGLO balance the pull to zero error, and a push away from it to avoid overfitting. This observation leads to an invariant that can be utilized to make the metalearning process more efficient in practice, and result in networks that are robust against adversarial attacks. Loss-function optimization can thus be seen as a well-founded new aspect of metalearning in neural networks.

1. INTRODUCTION

Regularization is a key concept in deep learning: it guides learning towards configurations that are likely to perform robustly on unseen data. Different regularization approaches originate from intuitive understanding of the learning process and have been shown to be effective empirically. However, the understanding of the underlying mechanisms, the different types of regularization, and their interactions, is limited. Recently, loss function optimization has emerged as a new area of metalearning, and shown great potential in training better models. Experiments suggest that metalearned loss functions serve as regularizers in a surprising but transparent way: they prevent the network from learning too confident predictions (e.g. Baikal loss; Gonzalez & Miikkulainen, 2020a). While it may be too early to develop a comprehensive theory of regularization, given the relatively nascent state of this area, it may be possible to make progress in understanding regularization of this specific type. That is the goal of this paper. Since metalearned loss functions are customized to a given architecture-task pair, there needs to be a shared framework under which loss functions can be analyzed and compared. The TaylorGLO (Gonzalez & Miikkulainen, 2020b) technique for loss function metalearning lends itself well to such analysis: It represents loss functions as multivariate Taylor polynomials, and leverages evolution to optimize a fixed number of parameters in this representation. In this framework, the SGD learning rule is decomposed to coefficient expressions that can be defined for a wide range of loss functions. These expressions provide an intuitive understanding of the training dynamics in specific contexts. Using this framework, mean squared error (MSE), cross-entropy, Baikal, and TaylorGLO loss functions are analyzed at the null epoch, when network weights are similarly distributed (Appendix C), and in a zero training error regime, where the training samples' labels have been perfectly memorized. For any intermediate point in the training process, the strength of the zero training error regime as an attractor is analyzed and a constraint on this property is derived on TaylorGLO parameters by characterizing how the output distribution's entropy changes. In a concrete TaylorGLO loss function that has been metalearned, these attraction dynamics are calculated for individual samples at every epoch in a real training run, and contrasted with those for the cross-entropy loss. This comparison provides clarity on how TaylorGLO avoids becoming overly confident in its predictions. Further, the analysis shows (in Appendix D.2) how label smoothing (Szegedy et al., 2016) , a traditional type

