ADAPTIVE OPTIMIZATION IN THE ∞-WIDTH LIMIT

Abstract

Recent works have developed detailed understanding of large neural networks' behaviors via their infinite-width limits, e.g., the neural tangent kernel (NTK) and the feature learning (µ) limits. These theories were developed for stochastic gradient descent. Yet, in practice, all large NN are trained using Adam or other adaptive gradient optimizers (AGO), which are not covered by such previous works. Here, we close this gap via the Tensor Programs framework. Specifically, for deep MLPs, we derive the NTK and µ parametrizations as well as their infinite-width limits. We find 1) The NTK limit of AGO, in contrast to that of SGD, now depends nonlinearly on the loss derivative but nevertheless still fails to learn features; 2) this is fixed by the µ limit of AGO (as in the case of SGD). To obtain these results, we extend the Tensor Programs language with a new instruction that allows one to express the gradient processing done by AGOs.

1. INTRODUCTION

Infinite width limits of neural networks have been a major focus of study in the last several years, underlying some of the most profound recent breakthroughs in our theoretical understanding of deep learning. Specifically, two types of limits have garnered the lions share of attention from the research community. The kernel limit, popularized by the seminal work of Jacot et al. (2018) refers to a regime of training where weights remain roughly in their initialized values, and training may be entirely characterized in function space by a constant kernel of a particular form, which depends on the network architecture. While easier to analyze, this limit does not permit updates to the internal representation of the network, hence it cannot account for data dependent feature learning, a staple of deep learning in practice. In contrast, the µ limit (of which the well-known mean field limit is a specific case in 1-hidden-layer perceptrons) refers to a regime of training where the weights adapt to the data during training in a nonlinear fashion, facilitating representation learning. It was recently shown in Yang & Hu (2020) that, under vanilla gradient based training, the precise setting of various hyperparameters relating to initialization scale and learning rate determine the type of infinite-width limit one can associate with a trained neural network. Notably, the µ parameterization was identified as the unique parameterization which gives rise to "maximal" feature learning dynamics in the infinite-width limit, where maximal refers to the fact that every layer learns features. However, quite remarkably, no such limits have yet been formally established for adaptive gradient based optimization of neural networks, which we make the focus of the present paper. Our main results in the paper are the identification and prescription of two types of infinite-width limits relating to popular AGO, the counterparts of the kernel and feature learning limits for vanilla GD. For the kernel limit counterpart, we uncover a fundamentally different dynamics for adaptive optimization, referred to as the adaptive neural tangent kernel (ANTK) regime. In this limit, the training dynamics can no longer be described by kernel gradient descent, since the kernel function itself depends non-linearly on the loss derivative. Our results lay a clear path to theoretically analyze the implicit biases of AGO in the infinite-width limit.

