CORTICALLY MOTIVATED RECURRENCE ENABLES TASK EXTRAPOLATION

Abstract

Feedforward deep neural networks have become the standard class of models in the field of computer vision. Yet, they possess a striking difference relative to their biological counterparts which predominantly perform "recurrent" computations. Why do biological neurons evolve to employ recurrence pervasively? In this paper, we show that a recurrent network is able to flexibly adapt its computational budget during inference and generalize within-task across difficulties. Simultaneously in this study, we contribute a recurrent module we call LocRNN that is designed based on a prior computational model of local recurrent intracortical connections in primates to support such dynamic task extrapolation. LocRNN learns highly accurate solutions to the challenging visual reasoning problems of Mazes and PathFinder that we use here. More importantly, it is able to flexibly use less or more recurrent iterations during inference to zero-shot generalize to less-and more difficult instantiations of each task without requiring extra training data, a potential functional advantage of recurrence that biological visual systems capitalize on. Feedforward networks on the other hand with their fixed computational graphs only partially exhibit this trend, potentially owing to image-level similarities across difficulties. We also posit an intriguing tradeoff between recurrent networks' representational capacity and their stability in the recurrent state space. Our work encourages further study of the role of recurrence in deep learning models -especially from the context of out-of-distribution generalization & task extrapolation -and their properties of task performance and stability.

1. INTRODUCTION

Deep learning based models for computer vision have recently matched and even surpassed humanlevel performance on various semantic tasks (Dosovitskiy et al., 2020; Vaswani et al., 2021; He et al., 2021; Liu et al., 2022) . While the gap between human and machine task performance has been diminishing with more successful deep learning architectures, differences in their architectures and in critical behaviors such as adversarial vulnerability (Athalye et al., 2018) , texture bias (Geirhos et al., 2018) , lack of robustness to perceptual distortions (Hendrycks & Dietterich, 2019; Geirhos et al., 2019) , etc. have increased significantly. We are interested in one such stark architectural difference between artificial and biological vision that we believe underlies the above-mentioned critical behavioral differences, i.e., recurrent neural processing. While biological neurons predominantly process input stimuli with recurrence, existing high-performing deep learning architectures are largely feedforward in nature. In this work, we argue for further incorporation of recurrent processing in future deep learning architectures -we compare matched recurrent and feedforward networks to show how the former are capable of extrapolating representations learned on a task to unseen difficulty levels without extra training data; feedforward networks are strictly restricted in this front as they cannot dynamically change their computational graph. Ullman (1984) introduced a particularly important research direction in visual cognition that is of fundamental importance to understanding the human ability to extrapolate learned representations within-task across difficulties. Ullman hypothesized that all visual tasks we perform are supported by combinations of a small set of key elemental operations that are applied in a sequential manner (analogous to recurrent processing) like an instruction set. Instances of varying difficulty levels of a task can be solved by dynamically piecing together shorter or longer sequences of operations corresponding to that task. This approach of decomposing tasks into a sequence of elemental opera-tions (aka visual routines) avoids the intractable need for vast amounts of training data representing every instantiation of a task and is hypothesized to support our human visual system's systematic generalization ability. Examples of such elemental operations that compose visual routines include incremental contour grouping, curve tracing, etc. and a review of these operations along with physiological evidence can be found in Roelfsema et al. (2000) . Can we develop artificial neural networks that also learn visual routines for any task when constrained to deliberately use specialized recurrent operations? In this work we show promising evidence for employing recurrent architectures to learn such general solutions to visual tasks and exhibit task extrapolation to various difficulty levels. We note that such extrapolation is one kind of out-of-distribution generalization that standard feedforward deep learning models struggle to perform. For this study, we perform experiments using two challenging synthetic visual tasks, namely Mazes and PathFinder (discussed in Sec. 3). We make the following contributions as part of this work: 1. We show the advantage of recurrent processing over feedforward processing on task extrapolation. We show evidence for strong visual task extrapolation using specialized recurrentconvolutional architectures including our proposed recurrent architecture, LocRNN on challenging visual reasoning problems. 2. We contribute LocRNN, a biologically inspired recurrent convolutional architecture that is developed based on Li (1998), a prior computational neuroscience model of recurrent processing in primate area V1. LocRNN introduces one connection type that is missing from commonly used deep neural networks -long-range lateral connections that connect neurons within the same layer in cortex Bosking et al. (1997) . We hypothesize such lateral recurrent connections to enable the learning and sequencing of elemental visual routines. 3. Comparing task performance alongside extrapolation performance of our various recurrent architectures, we posit the potential tradeoff between task performance -the ability of recurrent networks to learn sophisticated iterative functions of their input to excel on downstream tasks -vs stability in the state space -local smoothness of the trajectory of recurrent states through time. We show empirical evidence for an instance of this tradeoff we observe and identify an important open problem which must be solved to establish high-performing and stable recurrent architectures for vision. We combine cognitive insights from Ullman routines (Ullman, 1984) with a model of cortical recurrence (Li et al., 2006) from computational neuroscience to improve state-of-the-art recurrent neural network based machine learning architectures. This unique synergy results in demonstrating the superior ability of recurrent networks for task extrapolation by flexibly adapting their test-time computational budget, a feat not possible for feedforward architectures. Our work encourages future work to further explore the role of recurrence in the design of deep learning architectures that behave like humans and generalize out-of-distribution.

2. RELATED WORK

As mentioned our work is highly relevant to the visual routines literature introduced by Ullman (1984) and further reviewed elaborately in Roelfsema et al. (2000) . The core idea of visual routines that make it relevant to our studied question of task extrapolation is the flexible sequencing of elemental operations resulting in a dynamic computational graph. This idea has been studied by prior research on using recurrent neural networks to solve sequential problems. On a related note there have been several attempts to learn how much recurrent computation to use for a given input sample on a given task (Graves, 2016; Saxton et al., 2019) . The most relevant to our work is (Schwarzschild et al., 2021) where the authors evaluate the application of recurrent neural networks to generalize from easier to harder problems. Our work extends and differs from their intriguing study of recurrence in task extrapolation in the following ways: 1) While their work explores sequential task extrapolation in general with abstract problems such as Prefix Sum and solving Chess Puzzles, our work extends it to particularly focus on extrapolation in visual task learning. Hence their maze segmentation task is of relevance to us and we use the same for evaluation (while also re-implementing models used in their study). In addition, we present evaluation on the Pathfinder challenge (Linsley et al., 2018) , a relatively significantly more challenging and large-scale visual reasoning task, the design of which dates back to Jolicoeur et al. (1986) .

