UNSUPERVISED LEARNING OF FEATURES AND OBJECT BOUNDARIES FROM LOCAL PREDICTION

Abstract

The human visual system has to learn both which features to extract from images and how to group locations into (proto-)objects. Those two aspects are usually dealt with separately, although predictability is discussed as a cue for both. To incorporate features and boundaries into the same model, we model a retinotopic visual cortex with a pairwise Markov random field model in which each factor is paired with an additional binary variable, which switches the factor on or off. Using one of two contrastive learning objectives, we can learn both the features and the parameters of the Markov random field factors from images without further supervision signals. The features learned by shallow neural networks based on this loss are local averages, opponent colors, and Gabor-like stripe patterns as observed in early human visual cortices. Furthermore, we can infer connectivity between locations by inferring the switch variables. Contours inferred from this connectivity perform quite well on the Berkeley segmentation database (BSDS500) without any training on contours. Thus, optimizing predictions across space aids both segmentation and feature learning, and models trained this way show similarities to the human visual system. We speculate that retinotopic visual cortex might implement such predictions over space through lateral connections.

1. INTRODUCTION

A long-standing question about human vision is how representations initially be based on parallel processing of retinotopic feature maps can represent objects in a useful way. Most research on this topic has focused on computing later object-centered representations from the feature map representations. Psychology and neuroscience identified features that lead to objects being grouped together (Koffka, 1935; Köhler, 1967) , established feature integration into coherent objects as a sequential process (Treisman & Gelade, 1980) , and developed solutions to the binding problem, i.e. ways how neurons could signal whether they represent parts of the same object (Finger & König, 2014; Peter et al., 2019; Singer & Gray, 1995; Treisman, 1996) . In computer vision, researchers also focused on how feature map representations could be turned into segmentations and object masks. Classically, segmentation algorithm were clustering algorithms operating on extracted feature spaces (Arbeláez et al., 2011; Comaniciu & Meer, 2002; Cour et al., 2005; Felzenszwalb & Huttenlocher, 2004; Shi & Malik, 2000) , and this approach is still explored with more complex mixture models today (Vacher et al., 2022) . Since the advent of deep neural network models, the focus has shifted towards models that directly map to contour maps or semantic segmentation maps (Girshick et al., 2014; He et al., 2019; Kokkinos, 2016; Liu et al., 2017; Shen et al., 2015; Xie & Tu, 2015) , as reviewed by Minaee et al. (2021) . Diverse findings suggest that processing within the feature maps take object boundaries into account. For example, neurons appear to encode border ownership (Jeurissen et al., 2013; Peter et al., 2019; Self et al., 2019) and to fill in information across surfaces (Komatsu, 2006) and along illusory contours (Grosof et al., 1993; von der Heydt et al., 1984) . Also, attention spreading through the feature maps seems to respect object boundaries (Baldauf & Desimone, 2014; Roelfsema et al., 1998) . And selecting neurons that correspond to an object takes time, which scales with the distance between the points to be compared (Jeurissen et al., 2016; Korjoukov et al., 2012) . Finally, a long history of psychophysical studies showed that changes in spatial frequency and orientation content can define (texture) boundaries (e.g. Beck et al., 1987; Landy & Bergen, 1991; Wolfson & Landy, 1995) . In

