ITERATIVE CONVERGENT COMPUTATION IS NOT A USEFUL INDUCTIVE BIAS FOR RESNETS

Abstract

Recent work has suggested that feedforward residual neural networks (ResNets) approximate iterative recurrent computations. Iterative computations are useful in many domains, so they might provide good solutions for neural networks to learn. Here we quantify the degree to which ResNets learn iterative solutions and introduce a regularization approach that encourages learning of iterative solutions. Iterative methods are characterized by two properties: iteration and convergence. To quantify these properties, we define three indices of iterative convergence. Consistent with previous work, we show that, even though ResNets can express iterative solutions, they do not learn them when trained conventionally on computer vision tasks. We then introduce regularizations to encourage iterative convergent computation and test whether this provides a useful inductive bias. To make the networks more iterative, we manipulate the degree of weight sharing across layers using soft gradient coupling. This new method provides a form of recurrence regularization and can interpolate smoothly between an ordinary ResNet and a "recurrent" ResNet (i.e., one that uses identical weights across layers and thus could be physically implemented with a recurrent network computing the successive stages iteratively across time). To make the networks more convergent we impose a Lipschitz constraint on the residual functions using spectral normalization. The three indices of iterative convergence reveal that the gradient coupling and the Lipschitz constraint succeed at making the networks iterative and convergent, respectively. However, neither recurrence regularization nor spectral normalization improve classification accuracy on standard visual recognition tasks (MNIST, CIFAR-10, CIFAR-100) or on challenging recognition tasks with partial occlusions (Digitclutter). Iterative convergent computation, in these tasks, does not provide a useful inductive bias for ResNets.

1. INTRODUCTION

An iterative method solves a difficult estimation or optimization problem by starting from an initial guess and repeatedly applying a transformation that is known to improve the estimate, leading to a sequence of estimates that converges to the solution. Iterative methods provide a powerful approach to finding exact or approximate solutions where direct methods fail (e.g., for difficult inverse problems or solutions to systems of equations that are nonlinear and/or large). Recurrent neural networks (RNNs) iteratively apply the same transformation to their internal representation, suggesting that they may learn algorithms similar to the iterative methods used in mathematics and engineering. The idea of iterative refinement of a representation has also driven recent progress in the context of feedforward networks. New architectures, based on the idea of iterative refinement, have allowed for the training of very deep feedforward models with hundreds of layers. Prominent architectures for achieving high depth are residual (ResNets; He et al., 2016a) and highway networks (Srivastava et al., 2015) , which use skip connections to drive the network to learn the residual: a pattern of adjustments to the input, thus encouraging the model to learn successive refinements of a representation of the input that is shared across layers. These architectures combine two ideas. The first is to use skip connections to alleviate the problem of vanishing or exploding gradients (Hochreiter, 1991) . The second is to make these skip connections fixed identity connections, such that the layers learn successive refinement of a shared repre-sentational format. The second idea relates residual and highway networks to RNNs and iterative methods. Learning a single transformation that can be iteratively applied is attractive because it enables trading speed for accuracy by iterating longer (Spoerer et al., 2019) . In addition, a preference for an iterative solution may provide a useful inductive bias for certain computational tasks. However, it is unclear whether ResNets indeed learn solutions akin to iterative methods and, if they do, whether this is a useful inductive bias. The two defining features of iterative methods are (1) iteration and ( 2) convergence. Here we analyze to what extent these features emerge in ResNets. In order to investigate whether these features provide a useful inductive bias, we introduce two simple modifications of classical ResNets and study their impact on a number of datasets. First, we study CIFAR-10, CIFAR-100 (Krizhevsky, 2009), and MNIST (LeCun et al., 2010) as examples of classical vision tasks, assessing the networks' performance and sample efficiency. Since iterative and convergent inductive biases may be more useful for tasks that require some degree of recurrence, we also assess the networks' performance and sample efficiency on several variations of Digitclutter, a task which requires the recognition of multiple digits that occlude each other (Spoerer et al., 2017) . To study the effect of iteration, we manipulate the degree of weight sharing in ResNets, smoothly interpolating between ordinary and recurrent ResNets. We find that a higher degree of weight sharing tends to make the network more iterative, but does not result in improved performance or sample efficiency. This suggests that in ordinary ResNets, recurrent connections do not provide a useful inductive bias and the networks can harness the additional computational flexibility provided by non-recurrent residual blocks. Recurrence implies iteration, but not convergence, and so is not sufficient for a network to implement an iterative method as defined above. ResNets, whether they are recurrent (i.e. sharing weights across layers) or not, are therefore neither required nor encouraged to converge during training. We demonstrate empirically that ResNets in general do not exhibit convergent behavior and that recurrent ResNets are more convergent than non-recurrent networks. To study the effect of convergence, we upper bound the residual blocks' Lipschitz constant. This modification adversely impacts performance, suggesting that the non-convergent behavior in ordinary ResNets is not merely due to lack of incentive, but underpins the networks' high performance. Across convergent ResNets, a higher degree of weight sharing does not negatively affect performance. This suggests that convergent ResNets, in contrasts to non-convergent ones, do not benefit from the increased computational flexibility of non-recurrent residual blocks. Taken together, our results suggest that an inductive bias favoring an iterative convergent solution does not outweigh the computational flexibility of non-recurrent residual blocks for the considered tasks.

2. RELATED WORK

Prior theoretical work has focused on explaining the success of ResNets (He et al., 2016a) and the more general class of highway networks (Srivastava et al., 2015) by studying the learning dynamics in ResNets (Hochreiter, 1991; Orhan & Pitkow, 2018; Balduzzi et al., 2017) and their interpretation as an ensemble of shallow networks (Veit et al., 2016; Huang et al., 2018) , as a discretized dynamical system (E, 2017; Haber & Ruthotto, 2018; E et al., 2019) , and as performing iterative refinement. The iterative refinement hypothesis. Our work builds on Jastrzebski et al. ( 2018) who argue that the sequential application of the residual blocks in a ResNet iteratively refines a representational estimate. Their work builds on observations that dropping out residual blocks, shuffling their order, or evaluating the last block several times retains reasonable performance (Veit et al., 2016; Greff et al., 2017) and can be used for training (Huang et al., 2016) . Another set of methods uses such perturbations to train deep neural networks, using stochastic depth (Huang et al., 2016; Hu et al., 2019; Press et al., 2020) . Other methods learn to evaluate a limited number of layers that depends on the input (Graves, 2017; Figurnov et al., 2017) or learn increasingly fine-grained object categories across layers (Zamir et al., 2017) . Instead of using perturbations to encourage stability of the trained network, Ciccone et al. (2018) propose a method inspired by dynamical systems theory to guarantee such stability in their model. Iterative refinement and inverse problems. The iterative refinement hypothesis is particularly important in the context of inverse problems, which are often solved using iterative methods. This is

