VISION AT A GLANCE: INTERPLAY BETWEEN FINE AND COARSE INFORMATION PROCESSING PATHWAYS

Abstract

Object recognition is often viewed as a feedforward, bottom-up process in machine learning, but in real neural systems, object recognition is a complicated process which involves the interplay between two signal pathways. One is the parvocellular pathway (P-pathway), which is slow and extracts fine features of objects; the other is the magnocellular pathway (M-pathway), which is fast and extracts coarse features of objects. It has been suggested that the interplay between the two pathways endows the neural system with the capacity of processing visual information rapidly, adaptively, and robustly. However, the underlying computational mechanism remains largely unknown. In this study, we build a two-pathway model to elucidate the computational properties associated with the interactions between two visual pathways. The model consists of two convolution neural networks: one mimics the P-pathway, referred to as FineNet, which is deep, has small-size kernels, and receives detailed visual inputs; the other mimics the M-pathway, referred to as CoarseNet, which is shallow, has large-size kernels, and receives blurred visual inputs. The two pathways interact with each other to facilitate information processing. Specifically, we show that CoarseNet can learn from FineNet through imitation to improve its performance considerably, and that through feedback from CoarseNet, the performnace of FineNet is improved and becomes robust to noises. Using visual backward masking as an example, we demonstrate that our model can explain visual cognitive behaviors that involve the interplay between two pathways. We hope that this study will provide insight into understanding visual information processing and inspire the development of new object recognition architectures in machine learning.

1. INTRODUCTION

Imagine you are driving a car on a highway and suddenly an object appears in your visual field, crossing the road. Your initial reaction is to slam on the brakes even before recognizing the object. This highlights a core difference between human vision and current machine learning strategies for object recognition. In machine learning, visual object recognition is often viewed as a feedforward, bottom up process, where object features are extracted from local to global in a hierarchical manner; whereas in human vision, we can capture the gist of a visual object at a glance without processing the details of it, a crucial ability for us (especially animals) to survive in competitive natural environments. This strategic difference has been demonstrated by a large volume of experimental data. For examples, Sugase et al. (1999) found that neurons in the inferior temporal cortex (IT) of macaque monkeys convey the coarse information of an object much faster than the fine information of it; FMRI and MEG studies on humans showed that the activation of orbitofrontal cortex (OFC) precedes that of the temporal cortex when a blurred object was shown to the subject (Bar et al., 2006); Liu et al. (2017) further demonstrated that the dorsal pathway extracts the coarse information of an object in less than 100ms after the stimulus onset, and this coarse information guides the subsequent local information processing. Indeed, the Reverse Hierarchy Theory for visual perception has proposed that although the representation of image features along the ventral pathway goes from local to global, our perception of an object goes inversely from global to local (Hochstein & Ahissar, 2002) . How does this happen in the brain? Experimental studies have revealed that there exist two anatomically and functionally separated signal pathways for visual information processing (see Fig. 1 ). One is called the parvocellular

