ON THE NEURAL TANGENT KERNEL OF EQUILIBRIUM MODELS

Abstract

Existing analyses of the neural tangent kernel (NTK) for infinite-depth networks show that the kernel typically becomes degenerate as the number of layers grows. This raises the question of how to apply such methods to practical "infinite depth" architectures such as the recently-proposed deep equilibrium (DEQ) model, which directly computes the infinite-depth limit of a weight-tied network via rootfinding. In this work, we show that because of the input injection component of these networks, DEQ models have non-degenerate NTKs even in the infinite depth limit. Furthermore, we show that these kernels themselves can be computed by an analogous root-finding problem as in traditional DEQs, and highlight methods for computing the NTK for both fully-connected and convolutional variants. We evaluate these models empirically, showing they match or improve upon the performance of existing regularized NTK methods.

1. INTRODUCTION

Recent works empirically observe that as the depth of a weight-tied input-injected network increases, its output tends to converge to a fixed point. Motivated by this phenomenon, DEQ models were proposed to effectively represent an "infinite depth" network by root-finding. A natural question to ask is, what will DEQs become if their widths also go to infinity? It is well-known that at certain random initialization, neural networks of various structures converge to Gaussian processes as their widths go to infinity (Neal, 1996; Lee et al., 2017; Yang, 2019; Matthews et al., 2018; Novak et al., 2018; Garriga-Alonso et al., 2018) . Recent deep learning theory advances have also shown that in the infinite width limit, with proper initialization (the NTK initialization), training the network f θ with gradient descent is equivalent to solving kernel regression with respect to the neural tangent kernel (Arora et al., 2019; Jacot et al., 2018; Yang, 2019; Huang et al., 2020) . However, as the depth goes to infinity, Jacot et al. (2019) showed that the NTKs of fully-connected neural networks (FCNN) converge either to a constant (freeze), or to the Kronecker Delta (chaos). In this work, we show that with input injection, the DEQ-NTKs converge to meaningful fixed points that depend on the input in a non-trivial way, thus avoiding both freeze and chaos. Furthermore, analogous to DEQ models, we can compute these kernels by solving an analogous fixed point equation, rather than simply iteratively applying the updates associated with the traditional NTK. Moreover, such derivations carry over to other structures like convolution DEQs (CDEQ) as well. We evaluate the approach and demonstrate that it typically matches or improves upon the performance of existing regularized NTK methods.

2. BACKGROUND AND PRELIMINARIES

Bai et al. ( 2019) proposed the DEQ model, which is equivalent to running an infinite depth network with tied weight and input injection. These methods trace back to some of the original work in recurrent backpropagation (Almeida, 1990; Pineda, 1988) , but with specific emphasis on: 1) computing the fixed point directly via root-finding rather than forward iteration; and 2) incorporating the elements from modern deep networks in the single "layer", such as self-attention transformers (Bai et al., 2019) , multi-scale convolutions (Bai et al., 2020) , etc. The DEQ algorithm finds the infinite depth fixed point using quasi-Newton root finding methods, and then backpropagates using implicit differentiation without storing the derivatives in the intermediate layers, thus achieving a constant

