UNSUPERVISED LEARNING OF FEATURES AND OBJECT BOUNDARIES FROM LOCAL PREDICTION

Abstract

The human visual system has to learn both which features to extract from images and how to group locations into (proto-)objects. Those two aspects are usually dealt with separately, although predictability is discussed as a cue for both. To incorporate features and boundaries into the same model, we model a retinotopic visual cortex with a pairwise Markov random field model in which each factor is paired with an additional binary variable, which switches the factor on or off. Using one of two contrastive learning objectives, we can learn both the features and the parameters of the Markov random field factors from images without further supervision signals. The features learned by shallow neural networks based on this loss are local averages, opponent colors, and Gabor-like stripe patterns as observed in early human visual cortices. Furthermore, we can infer connectivity between locations by inferring the switch variables. Contours inferred from this connectivity perform quite well on the Berkeley segmentation database (BSDS500) without any training on contours. Thus, optimizing predictions across space aids both segmentation and feature learning, and models trained this way show similarities to the human visual system. We speculate that retinotopic visual cortex might implement such predictions over space through lateral connections.

1. INTRODUCTION

A long-standing question about human vision is how representations initially be based on parallel processing of retinotopic feature maps can represent objects in a useful way. Most research on this topic has focused on computing later object-centered representations from the feature map representations. Psychology and neuroscience identified features that lead to objects being grouped together (Koffka, 1935; Köhler, 1967) , established feature integration into coherent objects as a sequential process (Treisman & Gelade, 1980) , and developed solutions to the binding problem, i.e. ways how neurons could signal whether they represent parts of the same object (Finger & König, 2014; Peter et al., 2019; Singer & Gray, 1995; Treisman, 1996) . In computer vision, researchers also focused on how feature map representations could be turned into segmentations and object masks. Classically, segmentation algorithm were clustering algorithms operating on extracted feature spaces (Arbeláez et al., 2011; Comaniciu & Meer, 2002; Cour et al., 2005; Felzenszwalb & Huttenlocher, 2004; Shi & Malik, 2000) , and this approach is still explored with more complex mixture models today (Vacher et al., 2022) . Since the advent of deep neural network models, the focus has shifted towards models that directly map to contour maps or semantic segmentation maps (Girshick et al., 2014; He et al., 2019; Kokkinos, 2016; Liu et al., 2017; Shen et al., 2015; Xie & Tu, 2015) , as reviewed by Minaee et al. (2021) . Diverse findings suggest that processing within the feature maps take object boundaries into account. For example, neurons appear to encode border ownership (Jeurissen et al., 2013; Peter et al., 2019; Self et al., 2019) and to fill in information across surfaces (Komatsu, 2006) and along illusory contours (Grosof et al., 1993; von der Heydt et al., 1984) . Also, attention spreading through the feature maps seems to respect object boundaries (Baldauf & Desimone, 2014; Roelfsema et al., 1998) . And selecting neurons that correspond to an object takes time, which scales with the distance between the points to be compared (Jeurissen et al., 2016; Korjoukov et al., 2012) . Finally, a long history of psychophysical studies showed that changes in spatial frequency and orientation content can define (texture) boundaries (e.g. Beck et al., 1987; Landy & Bergen, 1991; Wolfson & Landy, 1995) . In both human vision and computer vision, relatively little attention has been given to these effects of grouping or segmentation on the feature maps themselves. Additionally, most theories for grouping and segmentation take the features in the original feature maps as given. In human vision, these features are traditionally chosen by the experimenter (Koffka, 1935; Treisman & Gelade, 1980; Treisman, 1996) or are inferred based on other research (Peter et al., 2019; Self et al., 2019) . Similarly, computer vision algorithms used off-the-shelf feature banks originally (Arbeláez et al., 2011; Comaniciu & Meer, 2002; Cour et al., 2005; Felzenszwalb & Huttenlocher, 2004; Shi & Malik, 2000) , and have recently moved towards deep neural network representations trained for other tasks as a source for feature maps (Girshick et al., 2014; He et al., 2019; Kokkinos, 2016; Liu et al., 2017; Shen et al., 2015; Xie & Tu, 2015) . Interestingly, predictability of visual inputs over space and time has been discussed as a solution for both these limitations of earlier theories. Predictability has been used as a cue for segmentation since the law of common fate of Gestalt psychology (Koffka, 1935) , and both lateral interactions in visual cortices and contour integration respect the statistics of natural scenes (Geisler & Perry, 2009; Geisler et al., 2001) . Among other signals like sparsity (Olshausen & Field, 1996) or reconstruction (Kingma & Welling, 2014) , predictability is also a well known signal for self-supervised learning of features (Wiskott & Sejnowski, 2002) , which has been exploited by many recent contrastive learning (e.g. Feichtenhofer et al., 2021; Gutmann & Hyvarinen, 2010; Hénaff et al., 2020; van den Oord et al., 2019) and predictive coding schemes (e.g. Lotter et al., 2017; 2018; van den Oord et al., 2019) for self-supervised learning. However, these uses of predictability for feature learning and for segmentation are usually studied separately. Here, we propose a model that learns both features and segmentation without supervision. Predictions between locations provide a self-supervised loss to learn the features, how to perform the prediction and how to infer which locations should be grouped. Also, this view combines contrastive learning (Gutmann & Hyvarinen, 2010; van den Oord et al., 2019) , a Markov random field model for the feature maps (Li, 2012) and segmentation into a coherent framework. We implement our model using some shallow architectures. The learned features resemble early cortical responses and the object boundaries we infer from predictability align well with human object contour reports from the Berkeley segmentation database (BSDS500 (Arbeláez et al., 2011) ). Thus, retinotopic visual cortex might implement similar computational principles as we propose here.

2. MODEL

To explain our combined model of feature maps and their local segmentation information, we start with a Gaussian Markov random field model (Li, 2012) with pairwise factors. We then add a variable w ∈ {0, 1} to each factor that governs whether the factor enters the product or not. This yields a joint distribution for the whole feature map and all w's. Marginalizing out the w's yields a Markov random field with "robust" factors for the feature map, which we can use to predict feature vectors from the vectors at neighboring positions. We find two contrastive losses based on these predictions that can be used to optimize the feature extraction and the factors in the Markov random field model. We model the distribution of k-dimensional feature maps f ∈ R k,m ′ ,n ′ that are computed from input images I ∈ R c,m,n with c = 3 color channels (see Fig. 1 A & B ). We use a Markov random field model with pairwise factors, i.e. we define the probability of encountering a feature map f with entries f i at locations i ∈ [1 . . . m ′ ] × [1 . . . n ′ ] as follows: p(f ) ∝ i ψ i (f i ) (i,j)∈N ψ ij (f i , f j ), where ψ i is the local factor, N is the set of all neighboring pairs, and ψ ij is the pairwise factor between positions i and jfoot_0 . We will additionally assume shift invariance, i.e. each point has the same set of nearby relative positions in the map as neighbors, ψ i is the same factor for each position, and each factor ψ ij depends only on the relative position of i and j.



i and j thus have two entries each

