EMERGENT PROPERTIES OF FOVEATED PERCEPTUAL SYSTEMS

Abstract

We introduce foveated perceptual systems -a hybrid architecture inspired by human vision, to explore the role of a texture-based foveation stage on the nature and robustness of subsequently learned visual representation in machines. Specifically, these two-stage perceptual systems first foveate an image, inducing a texture-like encoding of peripheral information -mimicking the effects of visual crowding -which is then relayed through a convolutional neural network (CNN) trained to perform scene categorization. We find that these foveated perceptual systems learn a visual representation that is distinct from their non-foveated counterpart through experiments that probe: 1) i.i.d and o.o.d generalization; 2) robustness to occlusion; 3) a center image bias; and 4) high spatial frequency sensitivity. In addition, we examined the impact of this foveation transform with respect to two additional models derived with a rate-distortion optimization procedure to compute matched-resource systems: a lower resolution non-foveated system, and a foveated system with adaptive Gaussian blurring. The properties of greater i.i.d generalization, high spatial frequency sensitivity, and robustness to occlusion emerged exclusively in our foveated texture-based models, independent of network architecture and learning dynamics. Altogether, these results demonstrate that foveation -via peripheral texture-based computations -yields a distinct and robust representational format of scene information relative to standard machine vision approaches, and also provides symbiotic computational support that texture-based peripheral encoding has important representational consequences for processing in the human visual system.

1. INTRODUCTION

In the human visual system, incoming light is sampled with different resolution across the retinal area, a stark contrast to machines that perceive images at uniform resolution. One account for the nature of this foveated (spatially-varying) array in humans is related purely to sensory efficiency (biophysical constraints) (Land & Nilsson, 2012; Eckstein, 2011) , e.g., there is only a finite amount of retinal ganglion cells (RGC) that can relay information from the retina to the LGN constrained by the flexibility and thickness of the optic nerve. Thus it is "more efficient" to have a moveable high-acuity fovea, rather than a non-moveable uniform resolution retina when given a limited number of photoreceptors as suggested in Akbas & Eckstein (2017) . Machines, however do not have such wiring/resource constraints -and with their already proven success in computer vision (LeCun et al., 2015) -this raises the question if a foveated inductive bias is even necessary for vision at all. However, it is also possible that foveation plays a functional role at the representational level, which can confer perceptual advantages as has been explored in humans. This idea has remained elusive in computer vision, but popular in vision science, and has been explored both psychophysically (Loschky et al., 2019) and computationally (Poggio et al., 2014; Cheung et al., 2017; Han et al., 2020) . There are several symbiotic examples arguing for the functional advantages of foveation in humans, via functional advantages in machine vision systems. For example, in the work of Pramod et al. (2018) , blurring the image in the periphery gave an increase in object recognition performance of computer vision systems by reducing their false positive rate. In Wu et al. ( 2018)'s GistNet, directly introducing a dual-stream foveal-peripheral pathway in a neural network boosted object detection performance via scene gist and contextual cueing. Relatedly, the most well known example of work that has directly shown the advantage of peripheral vision for scene processing in humans is Wang & Cottrell 2019). Here, each receptive field is locally perturbed with noise in its latent space in the direction of their equivalent texture representation (blue arrows) resulting in visual crowding effects in the periphery. These effects are most noticeable far away from the navy dot which is the simulated center of gaze (foveal region) of an observer. (2017)'s dual stream CNN that modelled the results of Larson & Loschky ( 2009) with a log-polar transform and adaptive Gaussian blurring (RGC-convergence). Taken together, these studies present support for the functional hypothesis of a foveated visual system. Importantly, none of these studies introducing the notion of texture representation in the periphery -a key property of peripheral computation as posed in Rosenholtz (2016). Testing whether this texture-based coding of the visual periphery is functional in any perceptual system is still an open question. Here we address this question directly. Specifically, we introduce foveated perceptual systems: these are two-stage (hybrid) systems that have a texture-based foveation stage followed by a deep convolutional neural network. In particular, we will mimic foveation over images using a transform that simulates visual crowding (Levi, 2011; Pelli, 2008; Doerig et al., 2019b; a) in the periphery as shown in Figure 1 (Deza et al., 2019) , rather than Gaussian blurring (Pramod et al., 2018; Wang & Cottrell, 2017) or compression (Patney et al., 2016; Kaplanyan et al., 2019) . These rendered images capture image statistics akin to those preserved in human peripheral vision, and resembling texture computation at the stage of area V2, as argued in Freeman & Simoncelli (2011); Rosenholtz (2016); Wallis et al. (2019) . Thus, our strategy in this paper is to compare these hybrid models' perceptual biases to their non-foveated counterpart through a set of experiments: generalization, robustness to occlusion, image-region bias and spatial frequency sensitivity. A difference from Wang & Cottrell (2017) is that our goal is not to implement foveation with adaptive gaussian blurring to fit known results to data (Larson & Loschky, 2009) ; but rather to explore the emergent representational consequences on scene representation following texture-based foveation. While it is certainly possible that in these machine vision systems that only need to categorize scenes, there may be little to no benefit of this texture-based computation; the logic of our approach however, is that any benefits or relevant differences between these systems can shed light into both the importance of texture-based peripheral computation in humans, and could suggest a new inductive bias for advanced machine perception.

2. FOVEATED PERCEPTUAL SYSTEMS

We define perceptual systems as two-stage with a foveation transform (stage 1, f (•) : R D → R D ), that is relayed to a deep convolutional neural network (stage 2, g(•) : R D → R d ). Note that the first transform stage is a fixed operation over the input image, while the second stage has learnable parameters. In general, the perceptual system S(•), with retinal image input I : R D is defined as: S(I) = g(f (I)) Such two stage models have been growing in popularity, and the reasons these models (including our own) are designed to not be fully end-to-end differentiable is mainly to force one type of computation into the first-stage of a system such that the second-stage g(•) must figure out how to capitalize on such forced transformation and thus assess its f (•) representational consequences. For example, Parthasarathy & Simoncelli (2020) successfully imposed V1-like computation in stage 1



Figure 1: A cartoon illustrating how a foveated image is rendered resembling a human visual metamer via the foveated feed-forward style transfer model of Deza et al. (2019). Here, each receptive field is locally perturbed with noise in its latent space in the direction of their equivalent texture representation (blue arrows) resulting in visual crowding effects in the periphery. These effects are most noticeable far away from the navy dot which is the simulated center of gaze (foveal region) of an observer.

