EMERGENT PROPERTIES OF FOVEATED PERCEPTUAL SYSTEMS

Abstract

We introduce foveated perceptual systems -a hybrid architecture inspired by human vision, to explore the role of a texture-based foveation stage on the nature and robustness of subsequently learned visual representation in machines. Specifically, these two-stage perceptual systems first foveate an image, inducing a texture-like encoding of peripheral information -mimicking the effects of visual crowding -which is then relayed through a convolutional neural network (CNN) trained to perform scene categorization. We find that these foveated perceptual systems learn a visual representation that is distinct from their non-foveated counterpart through experiments that probe: 1) i.i.d and o.o.d generalization; 2) robustness to occlusion; 3) a center image bias; and 4) high spatial frequency sensitivity. In addition, we examined the impact of this foveation transform with respect to two additional models derived with a rate-distortion optimization procedure to compute matched-resource systems: a lower resolution non-foveated system, and a foveated system with adaptive Gaussian blurring. The properties of greater i.i.d generalization, high spatial frequency sensitivity, and robustness to occlusion emerged exclusively in our foveated texture-based models, independent of network architecture and learning dynamics. Altogether, these results demonstrate that foveation -via peripheral texture-based computations -yields a distinct and robust representational format of scene information relative to standard machine vision approaches, and also provides symbiotic computational support that texture-based peripheral encoding has important representational consequences for processing in the human visual system.

1. INTRODUCTION

In the human visual system, incoming light is sampled with different resolution across the retinal area, a stark contrast to machines that perceive images at uniform resolution. One account for the nature of this foveated (spatially-varying) array in humans is related purely to sensory efficiency (biophysical constraints) (Land & Nilsson, 2012; Eckstein, 2011) , e.g., there is only a finite amount of retinal ganglion cells (RGC) that can relay information from the retina to the LGN constrained by the flexibility and thickness of the optic nerve. Thus it is "more efficient" to have a moveable high-acuity fovea, rather than a non-moveable uniform resolution retina when given a limited number of photoreceptors as suggested in Akbas & Eckstein (2017). Machines, however do not have such wiring/resource constraints -and with their already proven success in computer vision (LeCun et al., 2015) -this raises the question if a foveated inductive bias is even necessary for vision at all. However, it is also possible that foveation plays a functional role at the representational level, which can confer perceptual advantages as has been explored in humans. This idea has remained elusive in computer vision, but popular in vision science, and has been explored both psychophysically (Loschky et al., 2019) and computationally (Poggio et al., 2014; Cheung et al., 2017; Han et al., 2020) . There are several symbiotic examples arguing for the functional advantages of foveation in humans, via functional advantages in machine vision systems. For example, in the work of Pramod et al. (2018) , blurring the image in the periphery gave an increase in object recognition performance of computer vision systems by reducing their false positive rate. In Wu et al. ( 2018)'s GistNet, directly introducing a dual-stream foveal-peripheral pathway in a neural network boosted object detection performance via scene gist and contextual cueing. Relatedly, the most well known example of work that has directly shown the advantage of peripheral vision for scene processing in humans is Wang & Cottrell 1

