VISION AT A GLANCE: INTERPLAY BETWEEN FINE AND COARSE INFORMATION PROCESSING PATHWAYS

Abstract

Object recognition is often viewed as a feedforward, bottom-up process in machine learning, but in real neural systems, object recognition is a complicated process which involves the interplay between two signal pathways. One is the parvocellular pathway (P-pathway), which is slow and extracts fine features of objects; the other is the magnocellular pathway (M-pathway), which is fast and extracts coarse features of objects. It has been suggested that the interplay between the two pathways endows the neural system with the capacity of processing visual information rapidly, adaptively, and robustly. However, the underlying computational mechanism remains largely unknown. In this study, we build a two-pathway model to elucidate the computational properties associated with the interactions between two visual pathways. The model consists of two convolution neural networks: one mimics the P-pathway, referred to as FineNet, which is deep, has small-size kernels, and receives detailed visual inputs; the other mimics the M-pathway, referred to as CoarseNet, which is shallow, has large-size kernels, and receives blurred visual inputs. The two pathways interact with each other to facilitate information processing. Specifically, we show that CoarseNet can learn from FineNet through imitation to improve its performance considerably, and that through feedback from CoarseNet, the performnace of FineNet is improved and becomes robust to noises. Using visual backward masking as an example, we demonstrate that our model can explain visual cognitive behaviors that involve the interplay between two pathways. We hope that this study will provide insight into understanding visual information processing and inspire the development of new object recognition architectures in machine learning.

1. INTRODUCTION

Imagine you are driving a car on a highway and suddenly an object appears in your visual field, crossing the road. Your initial reaction is to slam on the brakes even before recognizing the object. This highlights a core difference between human vision and current machine learning strategies for object recognition. In machine learning, visual object recognition is often viewed as a feedforward, bottom up process, where object features are extracted from local to global in a hierarchical manner; whereas in human vision, we can capture the gist of a visual object at a glance without processing the details of it, a crucial ability for us (especially animals) to survive in competitive natural environments. This strategic difference has been demonstrated by a large volume of experimental data. For examples, Sugase et al. (1999) found that neurons in the inferior temporal cortex (IT) of macaque monkeys convey the coarse information of an object much faster than the fine information of it; FMRI and MEG studies on humans showed that the activation of orbitofrontal cortex (OFC) precedes that of the temporal cortex when a blurred object was shown to the subject (Bar et al., 2006); Liu et al. (2017) further demonstrated that the dorsal pathway extracts the coarse information of an object in less than 100ms after the stimulus onset, and this coarse information guides the subsequent local information processing. Indeed, the Reverse Hierarchy Theory for visual perception has proposed that although the representation of image features along the ventral pathway goes from local to global, our perception of an object goes inversely from global to local (Hochstein & Ahissar, 2002) . How does this happen in the brain? Experimental studies have revealed that there exist two anatomically and functionally separated signal pathways for visual information processing (see Fig. 1 ). One is called the parvocellular pathway (P-pathway), which starts from midget retina ganglion cells (MRGCs), projects to layers 3-6 in the lateral geniculate nucleus (LGN), and then primarily goes downstream along the ventral stream. The other is called the magnocellular pathway (M-pathway), which starts from parasol retina ganglion cells (PRGCs), projects to layers 1-2 of LGN, and then goes along the dorsal stream or the subcortical pathway (the superior colliculus and downstream areas). The two pathways have different neural response characteristics and complementary computational roles. Experimental findings have shown that the P-pathway is sensitive to colors and responds primarily to visual inputs of high spatial frequency; whereas the M-pathway is color blind and responds primarily to visual inputs of low spatial frequency (Derrington & Lennie, 1984) . It has been suggested that the M-pathway serves as a short-cut to extract coarse information of images rapidly, while the P-pathway extracts fine features of images slowly, and the interplay between two pathways endows the neural system with the capacity of processing visual information rapidly, adaptively, and robustly (Bar, 2003; Wang et al., 2020; Bullier, 2001; Liu et al., 2017) . For instance, by extracting the coarse information of an image, the M-pathway can generate predictions about what are expected in the visual field, and this knowledge subsequently modulate the fine information processing in the P-pathway (Fig. 1 ). Although the existence of separated P-and M-pathways is well known in the neuroscience field, exactly how they cooperate with each other to facilitate information processing remains poorly understood. In this study, we build up a two-pathway model to elucidate the computational properties associated with the interplay between two pathways (Fig. 2 ). We use convolution neural networks (CNNs) as the building blocks, since recent studies have revealed that CNNs are effective to model the neuronal response variability along the visual pathway (Yamins et al., 2013; Kriegeskorte, 2015) . Specifically, we model the P-pathway using a relatively deep CNN, which has small-size kernels and receives detailed visual inputs, referred to as FineNet hereafter. The M-pathway is modeled by a relatively shallow CNN, which has large-size kernels and receives blurred visual inputs, referred to as CoarseNet hereafter. Based on the proposed model, we investigate several computational issues associated with the interplay between two pathways, including how CoarseNet learns from FineNet via imitation, and how FineNet benefits from CoarseNet via feedback to leverage its performance. We also use the two-pathway model to reproduce the backward masking phenomenon observed in human psychophysic experiments.

2. THE TWO-PATHWAY MODEL

The structure of our two-pathway model is illustrated in Fig. 2 , where FineNet and CoarseNet mimic the P-and M-pathways, respectively. Notably, FineNet is deeper than CoarseNet, reflecting that the P-pathway goes through more feature analyzing relays (e.g., V1-V2-V4-IT along the ventral pathway) than the M-pathway. FineNet also has smaller convolutional kernels than CoarseNet, reflecting that MRGCs in the retina have much smaller receptive fields than PRGCs. Furthermore, we consider that FineNet receives detailed and colourful visual inputs, reflecting that MRGCs have small receptive fields and are color sensitive; while CoarseNet receives blurred and gray inputs, reflecting that PRGCs have large receptive fields and are color blind.



Figure1: Illustration of the two separated pathways for information processing in the visual system. An image of an eagle is processed through two pathways. Upper panel: the P-pathway processes the detailed information of the image. Lower panel: the M-pathway processes the coarse information of the image rapidly, generates predictions about the image (association), and modulates the information processing of the P-pathway (feedback). MRGC: midget retina ganglion cell. PRGC: parasol retina ganglion cells. EVA: early visual area. LOC: lateral occipital complex. IPS: intraparietal sulcus. SC: superior colliculus. PFC: prefrontal cortex.

