ADAPTIVE OPTIMIZATION IN THE ∞-WIDTH LIMIT

Abstract

Recent works have developed detailed understanding of large neural networks' behaviors via their infinite-width limits, e.g., the neural tangent kernel (NTK) and the feature learning (µ) limits. These theories were developed for stochastic gradient descent. Yet, in practice, all large NN are trained using Adam or other adaptive gradient optimizers (AGO), which are not covered by such previous works. Here, we close this gap via the Tensor Programs framework. Specifically, for deep MLPs, we derive the NTK and µ parametrizations as well as their infinite-width limits. We find 1) The NTK limit of AGO, in contrast to that of SGD, now depends nonlinearly on the loss derivative but nevertheless still fails to learn features; 2) this is fixed by the µ limit of AGO (as in the case of SGD). To obtain these results, we extend the Tensor Programs language with a new instruction that allows one to express the gradient processing done by AGOs.

1. INTRODUCTION

Infinite width limits of neural networks have been a major focus of study in the last several years, underlying some of the most profound recent breakthroughs in our theoretical understanding of deep learning. Specifically, two types of limits have garnered the lions share of attention from the research community. The kernel limit, popularized by the seminal work of Jacot et al. (2018) refers to a regime of training where weights remain roughly in their initialized values, and training may be entirely characterized in function space by a constant kernel of a particular form, which depends on the network architecture. While easier to analyze, this limit does not permit updates to the internal representation of the network, hence it cannot account for data dependent feature learning, a staple of deep learning in practice. In contrast, the µ limit (of which the well-known mean field limit is a specific case in 1-hidden-layer perceptrons) refers to a regime of training where the weights adapt to the data during training in a nonlinear fashion, facilitating representation learning. It was recently shown in Yang & Hu (2020) that, under vanilla gradient based training, the precise setting of various hyperparameters relating to initialization scale and learning rate determine the type of infinite-width limit one can associate with a trained neural network. Notably, the µ parameterization was identified as the unique parameterization which gives rise to "maximal" feature learning dynamics in the infinite-width limit, where maximal refers to the fact that every layer learns features. However, quite remarkably, no such limits have yet been formally established for adaptive gradient based optimization of neural networks, which we make the focus of the present paper. Our main results in the paper are the identification and prescription of two types of infinite-width limits relating to popular AGO, the counterparts of the kernel and feature learning limits for vanilla GD. For the kernel limit counterpart, we uncover a fundamentally different dynamics for adaptive optimization, referred to as the adaptive neural tangent kernel (ANTK) regime. In this limit, the training dynamics can no longer be described by kernel gradient descent, since the kernel function itself depends non-linearly on the loss derivative. Our results lay a clear path to theoretically analyze the implicit biases of AGO in the infinite-width limit. down the relevant neural network computation (e.g. the first forward pass in the NNGP case) as a principled composition of matrix multiplication and coordinatewise nonlinearities, called a Tensor Program, and 2) recursively calculate the distribution of coordinates of each vector via what's called the Master Theorem. However flexible, the "language" of TP is not expressive enough to represent the necessary computations involving adaptive optimization since it does not support the application of nonlinear functions to high order tensors. In the present paper, we solve this issue by expanding the TP framework with additional functionities, and proving a new master theorem which enables our analysis. While we present a simple application of our new framework on MLPs in Theorem 4.1 and Theorem 4.2, it is applicaple in a much wider setting, including most practical architectures and algorithms. As an additional technical contribution, we prove a O(n -1/2 ) (where n represents the width) convergence rate guarantee for all variables produced by the program, which might be of independent interest. Our Contributions: This paper presents the following major contributions: 1. We present the first rigorous infinite-width analysis of adaptive optimization of MLPs parameterized using the ANTK and µ parameterizations. Our results rigorously equate training of such networks to discrete time dynamical equations. 2. We develop a new tensor program framework along convergence rate guarantees, unlocking the infinite-width analysis of adaptive optimization in an architecturally universal sense. Paper Organization: This paper is organized as follows: We survey related work in Section 2. In Section 3 we set up preliminaries and notations used extensively in Section 4. In Section 4 we illustrate ANTK and µ limits for MLPs. Section 5 is dedicated to a formal introduction to the new TP framework. Although it is used as a main tool to prove our results in Section 4, Section 5 is more general and can be read as a standalone.

2. RELATED WORK

A large body of literature exists on both the kernel (NTK) limit Arora et al. ( 2019 Yang & Hu (2020) defined a natural space of neural network parametrizations (abc-parametrizations), and classified all resulting infinite-width limits into two possible catagories: 1) the kernel limit, in which weights and activations remain roughly in their initialization state, and 2) feature learning limit, in which weights move substantially and adapt to data. The µ parameterization was then identified as the "optimal" parameterization for arbitrary architectures in which all layers learn features, and was later heuristically extended to AGOs Yang et al. 



); Jacot et al. (2018); Lee et al. (2019); Yang (2020c); Yang & Littwin (2021) and the mean field limit for 2 layer neural network Chizat & Bach (2018); Mei et al. (2018b); Nguyen & Pham (2020); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2020). Various papers describe the kernel and feature learning regimes more generally without taking an infinite-width limit. Chizat et al. (2019) describes the "lazy training" regime in arbitrary differentiable programs, and is controlled by a single parameter α which scales the output. It is shown that when α is large, the weight need only move slightly to fit the training data, and network essentially performs kernel learning. Many papers Allen-Zhu et al. (2019); Huang & Yau (2020); Mei et al. (2018a) view the kernel and feature learning regimes as learning in different timescales, explicitly incorporating the time dependence in the infinite-width limit, and others derive finite width corrections to the NTK for finite width networks Hanin & Nica (2020); Littwin et al. (2020a). In this paper, we consider training time to be constant, and take only the width to infinity. This way, kernel and feature learning behaviour are separated by the parameterization employed at initialization, and not width or training time. TPs, first introduced in Yang (2019) and expanded upon in Yang (2020a;b), were developed as a theoretical framework to analyze the infinite-width limits of any architecture expressible in the TP language, in an attempt to rid the per architecture analysis prevalent in the literature Alemohammad et al. (2021); Du et al. (2019); Hron et al. (2020); Littwin et al. (2020b).

(2021). Unrelated, AGOs Duchi et al. (2010); Kingma & Ba (2015); Zhou et al. (2018) and their variants were developed to accelerate learning by adapting the learning rate on a per parameter basis, and currently serve as a prerequisite for training large scale transformer models Huang et al. (2020); Liu et al. (2020); Zhang et al. (2019). Crucially, no previous work has yet developed a theory for infinite-width neural network trained with AGOs.

