UNSUPERVISED LEARNING OF FEATURES AND OBJECT BOUNDARIES FROM LOCAL PREDICTION

Abstract

The human visual system has to learn both which features to extract from images and how to group locations into (proto-)objects. Those two aspects are usually dealt with separately, although predictability is discussed as a cue for both. To incorporate features and boundaries into the same model, we model a retinotopic visual cortex with a pairwise Markov random field model in which each factor is paired with an additional binary variable, which switches the factor on or off. Using one of two contrastive learning objectives, we can learn both the features and the parameters of the Markov random field factors from images without further supervision signals. The features learned by shallow neural networks based on this loss are local averages, opponent colors, and Gabor-like stripe patterns as observed in early human visual cortices. Furthermore, we can infer connectivity between locations by inferring the switch variables. Contours inferred from this connectivity perform quite well on the Berkeley segmentation database (BSDS500) without any training on contours. Thus, optimizing predictions across space aids both segmentation and feature learning, and models trained this way show similarities to the human visual system. We speculate that retinotopic visual cortex might implement such predictions over space through lateral connections.

1. INTRODUCTION

A long-standing question about human vision is how representations initially be based on parallel processing of retinotopic feature maps can represent objects in a useful way. Most research on this topic has focused on computing later object-centered representations from the feature map representations. Psychology and neuroscience identified features that lead to objects being grouped together (Koffka, 1935; Köhler, 1967) , established feature integration into coherent objects as a sequential process (Treisman & Gelade, 1980) , and developed solutions to the binding problem, i.e. ways how neurons could signal whether they represent parts of the same object (Finger & König, 2014; Peter et al., 2019; Singer & Gray, 1995; Treisman, 1996) . In computer vision, researchers also focused on how feature map representations could be turned into segmentations and object masks. Classically, segmentation algorithm were clustering algorithms operating on extracted feature spaces (Arbeláez et al., 2011; Comaniciu & Meer, 2002; Cour et al., 2005; Felzenszwalb & Huttenlocher, 2004; Shi & Malik, 2000) , and this approach is still explored with more complex mixture models today (Vacher et al., 2022) . Since the advent of deep neural network models, the focus has shifted towards models that directly map to contour maps or semantic segmentation maps (Girshick et al., 2014; He et al., 2019; Kokkinos, 2016; Liu et al., 2017; Shen et al., 2015; Xie & Tu, 2015) , as reviewed by Minaee et al. (2021) . Diverse findings suggest that processing within the feature maps take object boundaries into account. For example, neurons appear to encode border ownership (Jeurissen et al., 2013; Peter et al., 2019; Self et al., 2019) and to fill in information across surfaces (Komatsu, 2006) and along illusory contours (Grosof et al., 1993; von der Heydt et al., 1984) . Also, attention spreading through the feature maps seems to respect object boundaries (Baldauf & Desimone, 2014; Roelfsema et al., 1998) . And selecting neurons that correspond to an object takes time, which scales with the distance between the points to be compared (Jeurissen et al., 2016; Korjoukov et al., 2012) . Finally, a long history of psychophysical studies showed that changes in spatial frequency and orientation content can define (texture) boundaries (e.g. Beck et al., 1987; Landy & Bergen, 1991; Wolfson & Landy, 1995) . In both human vision and computer vision, relatively little attention has been given to these effects of grouping or segmentation on the feature maps themselves. Additionally, most theories for grouping and segmentation take the features in the original feature maps as given. In human vision, these features are traditionally chosen by the experimenter (Koffka, 1935; Treisman & Gelade, 1980; Treisman, 1996) or are inferred based on other research (Peter et al., 2019; Self et al., 2019) . Similarly, computer vision algorithms used off-the-shelf feature banks originally (Arbeláez et al., 2011; Comaniciu & Meer, 2002; Cour et al., 2005; Felzenszwalb & Huttenlocher, 2004; Shi & Malik, 2000) , and have recently moved towards deep neural network representations trained for other tasks as a source for feature maps (Girshick et al., 2014; He et al., 2019; Kokkinos, 2016; Liu et al., 2017; Shen et al., 2015; Xie & Tu, 2015) . Interestingly, predictability of visual inputs over space and time has been discussed as a solution for both these limitations of earlier theories. Predictability has been used as a cue for segmentation since the law of common fate of Gestalt psychology (Koffka, 1935) , and both lateral interactions in visual cortices and contour integration respect the statistics of natural scenes (Geisler & Perry, 2009; Geisler et al., 2001) . Among other signals like sparsity (Olshausen & Field, 1996) or reconstruction (Kingma & Welling, 2014) , predictability is also a well known signal for self-supervised learning of features (Wiskott & Sejnowski, 2002) , which has been exploited by many recent contrastive learning (e.g. Feichtenhofer et al., 2021; Gutmann & Hyvarinen, 2010; Hénaff et al., 2020; van den Oord et al., 2019) and predictive coding schemes (e.g. Lotter et al., 2017; 2018; van den Oord et al., 2019) for self-supervised learning. However, these uses of predictability for feature learning and for segmentation are usually studied separately. Here, we propose a model that learns both features and segmentation without supervision. Predictions between locations provide a self-supervised loss to learn the features, how to perform the prediction and how to infer which locations should be grouped. Also, this view combines contrastive learning (Gutmann & Hyvarinen, 2010; van den Oord et al., 2019) , a Markov random field model for the feature maps (Li, 2012) and segmentation into a coherent framework. We implement our model using some shallow architectures. The learned features resemble early cortical responses and the object boundaries we infer from predictability align well with human object contour reports from the Berkeley segmentation database (BSDS500 (Arbeláez et al., 2011) ). Thus, retinotopic visual cortex might implement similar computational principles as we propose here.

2. MODEL

To explain our combined model of feature maps and their local segmentation information, we start with a Gaussian Markov random field model (Li, 2012) with pairwise factors. We then add a variable w ∈ {0, 1} to each factor that governs whether the factor enters the product or not. This yields a joint distribution for the whole feature map and all w's. Marginalizing out the w's yields a Markov random field with "robust" factors for the feature map, which we can use to predict feature vectors from the vectors at neighboring positions. We find two contrastive losses based on these predictions that can be used to optimize the feature extraction and the factors in the Markov random field model. We model the distribution of k-dimensional feature maps f ∈ R k,m ′ ,n ′ that are computed from input images I ∈ R c,m,n with c = 3 color channels (see Fig. 1 A & B ). We use a Markov random field model with pairwise factors, i.e. we define the probability of encountering a feature map f with entries f i at locations i ∈ [1 . . . m ′ ] × [1 . . . n ′ ] as follows: p(f ) ∝ i ψ i (f i ) (i,j)∈N ψ ij (f i , f j ), where ψ i is the local factor, N is the set of all neighboring pairs, and ψ ij is the pairwise factor between positions i and jfoot_0 . We will additionally assume shift invariance, i.e. each point has the same set of nearby relative positions in the map as neighbors, ψ i is the same factor for each position, and each factor ψ ij depends only on the relative position of i and j. In the actual models, these feature maps are higher dimensional maps extracted by a convolutional neural network. C: Illustration of the factor that links the feature vectors at two neighboring locations for a 1D feature. Top row: projection of the factor ψ ij onto the difference between the features value f i -f j , showing the combination of a Gaussian around 0 and a constant function for the connection variable w ij being 1 or 0 respectively. Middle row: 2D representation of the factor and its parts plotted against both feature values. Bottom row: Multiplication of the middle row with the standard normal factor for each position yielding the joint distribution of two isolated positions. D: Neighborhoods of different sizes used in the models, scaling from 4 to 20 neighbors for each location. We now add a binary variable w ∈ {0, 1} to each pairwise factor that encodes whether the factor is 'active' (w = 1) for that particular image (Fig. 1 C ). To scale the probability of w = 1 and w = 0 relative to each other, we add a factor that scales them with constants p ij ∈ [0, 1] and 1 -p ij respectively: p(f , w) ∝ i ψ i (f i ) (i,j)∈N p wij ij (1 -p ij ) 1-wij ψ ij (f i , f j ) wij Finally, we assume that the factors are Gaussian and the feature vectors are originally normalized to have mean 0 and variance 1: p(f , w) = 1 Z 0 N (f , 0, I) (i,j)∈N p wij ij (1 -p ij ) 1-wij Z(w ij , C ij ) exp - w ij 2 (f i -f j ) T C ij (f i -f j ) , where Z 0 is the overall normalization constant, N (f , 0, I) is the density of a standard normal distribution with k × m ′ × n ′ dimensions, C ij governs the strength of the coupling in the form of a precision matrix, which we will assume to be diagonal, and Z(w ij , C ij ) scales the distributions with w ij = 0 and w ij = 1 relative to each other. We set Z(w ij , C ij ) to the normalization constant of the Gaussian with standard Gaussian factors for f i and f j respectively. For w = 0 this is just (2π) -k , the normalization constant of a standard Gaussian in 2k dimensions. For w = 1 we get: Z(w ij = 1, C ij ) = exp - 1 2 f T i f i - 1 2 f T j f j - 1 2 (f i -f j ) T C ij (f i -f j ) df i df j (4) = (2π) -k det I + C ij C ij C ij I + C ij 1 2 (5) = (2π) -k l √ 1 + 2c ll (6) which we get by computing the normalization constant of a Gaussian with the given precision and then using the assumption that C ij is a diagonal matrix with diagonal entries c ll . This normalization depends only on w and the coupling matrix C of the factor ψ ij and thus induces a valid probability distribution on the feature maps. Two points are notable about this normalization though: First, once other factors also constrain f i and/or f j , this normalization will not guarantee p(w ij = 1) = p ij .foot_1 Second, the w ij are not independent in the resulting distribution. For example, if pairwise factors connect a to b, b to c and a to c the corresponding w are dependent, because w ab = 1 and w bc = 1 already imply a smaller difference between f a and f c than if these factor were inactive, which increases the probability for w ac = 1.

2.1. LEARNING

To learn our model from data, we use a contrastive learning objective on the marginal likelihood p(f ). To do so, we first need to marginalize out the w's, which is fortunately simple, because each w affects only a single factor: p(f ) = w p(f , w) = 1 Z 0 N (f , 0, I) (i,j)∈N [p ij ψ ij (f i , f j ) + (1 -p ij )] Using this marginal likelihood directly for fitting is infeasible though, because computing Z 0 , i.e. normalizing this distribution is not computationally tractable. We resort to contrastive learning to fit the unnormalized probability distribution (Gutmann & Hyvarinen, 2010) , i.e. we optimize discrimination from a noise distribution with the same support as the target distribution. Following van den Oord et al. ( 2019) we do not optimize the Markov random field directly, but optimize predictions based on the model using features from other locations as the noise distribution. For this noise distribution, the factors that depend only on a single location (the first product in (1)) will cancel. We thus ignore the N (f , 0, I) in our optimization and instead normalize the feature maps to mean 0 and unit variance across each image. We define two alternative losses that make predictions for positions based on all their neighbors or for a single factor respectively.

2.1.1. POSITION LOSS

The position loss optimizes the probability of the feature vector at each location relative to the probability of randomly chosen other feature vectors from different locations and images: l pos (f ) = i log p(f i |f j ∀j ∈ N (i)) i ′ p(f i ′ |f j ∀j ∈ N (i)) (8) = i j∈N (i) log ψ ij (f i , f j ) - i log   i ′ exp   j∈N (i) log ψ ij (f i ′ , f j )     , where N (i) is the set of neighbors of i.

2.1.2. FACTOR LOSS

The factor loss instead maximizes each individual factor for the correct feature vectors relative to random pairs of feature vectors sampled from different locations and images: l fact = i,j log ψ ij (f i , f j ) i ′ ,j ′ ψ ij (f i ′ , f j ′ ) (10) = i,j log ψ ij (f i , f j ) - i,j log i ′ ,j ′ ψ ij (f i ′ , f j ′ ), where i, j index the correct locations and i ′ , j ′ index randomly drawn locations, in our implementation generated by shuffling the feature maps and taking all pairs that occur in these shuffled maps.

2.1.3. OPTIMIZATION

We optimize all weights of the neural network used for feature extraction and the parameters of the random field, i.e. the C and p ij for the different relative spatial locations simultaneously. As an optimization algorithm, we use stochastic gradient descent with momentum. Both losses succeed to learn the model, but the factor loss is substantially more efficient. We discuss the distinction between the two losses and further details of the optimization in the supplementary materials.

2.2. SEGMENTATION INFERENCE

Computing the probability for any individual pair of locations (i, j) to be connected, i.e. computing p(w ij = 1|f ), depends only on the two connected feature vectors f i and f j : p(w ij = 1|f ) p(w ij = 0|f ) = p ij (1 -p ij ) Z(w ij = 0, C ij ) Z(w ij = 1, C ij ) exp -(f i -f j ) T C ij (f i -f j ) This inference effectively yields a connectivity measure for each pair of neighboring locations, i.e. a sparse connectivity matrix. Given that we did not apply any prior information enforcing continuous objects or contours, the inferred w ij do not necessarily correspond to a valid segmentation or set of contours. Finding the best fitting contours or segmentation for given probabilities for the ws is an additional process, which in humans appears to be an attention-dependent serial process (Jeurissen et al., 2016; Self et al., 2019) . To evaluate the detected boundaries in computer vision benchmarks, we nonetheless need to convert the connectivity matrix we extracted into a contour image. To do so, we use the spectral-clusteringbased globalization method developed by Arbeláez et al. (2011) . This method requires that all connection weights between nodes are positive. To achieve this, we transform the log-probability ratios for the w ij as follows: For each image, we find the 30% quantile of the values, subtract it from all log-probability ratios, and set all values below 0.01 to 0.01. We then compute the smallest eigenvectors of the graph Laplacian as in graph spectral clustering. These eigenvectors are then transformed back into image space and are filtered with simple edge detectors to find the final contours.

3. EVALUATION

We implement 3 model types implementing feature extractions of increasing complexity in PyTorch (Paszke et al., 2019) : Pixel value model. For illustrative purposes, we first apply our ideas to the rgb pixel values of an image as features. This provides us with an example, where we can easily show the feature values and connections. Additionally, this model provides an easy benchmark for all evaluations. Linear model. As the simplest kind of model that allows learning features, we use a single convolutional deep neural network layer as our feature model. Here, we use 50 11 × 11 linear features. Predseg1: To show that our methods work for more complex architecture with non-linearities, we use a relatively small deep neural network with 4 layers (2 convolutional layers and 2 residual blocks with subsampling layers between them, see supplement for details). For each of these architectures, we train 24 different networks with all combinations of the following settings: 4 different sizes of neighborhoods (4, 8, 12 , or 20 neighbors, see Fig. 1D ); 3 different noise levels (0, 0.1, 0.2) and the two learning objectives. As a training set, we used the unlabeled image set from MS COCO (Lin et al., 2015) , which contains 123,404 color images with varying resolution. To enable batch processing, we randomly crop these images to 256 × 256 pixel resolution, but use no other data augmentation (See supplementary information for further training details). We want to evaluate whether our models learn human-like features and segmentations. To do so, we first analyze the features in the first layers of our networks where we can judge whether features are representative of biological visual systems. In particular, we extract segmentations from our activations and evaluate those on the Berkeley Segmentation Dataset (Arbeláez et al., 2011, BSDS500) 3.1 LEARNED FEATURES

Linear Model

We first analyze the weights in our linear models (Fig 2 A-C ). All instances learn local averages and Gabor-like striped features, i.e. spatial frequency and orientation tuned features with limited spatial extend. These features clearly resemble receptive fields of neurons in primary visual cortex. Additionally, there appears to be some preference for features that weight the red and (Canny, 1986) , the mean shift algorithm (Comaniciu & Meer, 2002) , Felzenschwalbs algorithm (Felzenszwalb & Huttenlocher, 2004 ) and segmentation based on normalized cuts (Cour et al., 2005) . For all comparison algorithms evaluations on BSDS were extracted from the figure by Arbeláez et al. (2011) green color channels much stronger than the blue channel, similar to the human luminance channel, which leads to the yellow-blue contrasts in the plots. There is some difference between the two learning objectives though. The position based loss generally leads to lower frequency and somewhat noisier features. This could either be due to the higher learning efficiency of the factor based loss, i.e. the factor based loss is closer to convergence, or due to a genuinely different optimization goal.

Predseg1

In Predseg1, we first analyze the layer 0 convolution (Fig. 2D ), which has only 3 channels with 3 × 3 receptive fields, which we originally introduced as a learnable downsampling. This layer consistently converges to applying near constant weights over space. Additionally, exactly one of the channels has a non-zero mean (the 3rd, 1st and 3rd in Fig. 2D ) and the other two take balanced differences between two of the channels (red vs green and green vs. blue in the examples). This parallels the luminance and opponent color channels of human visual perception. In the second convolution, we observe a similar pattern of oriented filters and local averages as in the linear model albeit in false color as the input channels are rotated by the weighting of the layer 0 convolution (Fig. 2 E & F).

3.2. CONTOUR EXTRACTION

To evaluate whether the connectivity information extracted by our model corresponds to human perceived segmentation, we extract contours from our models and compare them to contours reported by humans for the Berkeley Segmentation Database (Arbeláez et al., 2011; Martin et al., 2001) . This database contains human drawn object boundaries for 500 natural images and is accompanied by methods for evaluating segmentation models. Using the methods provided with the database, we compute precision-recall curves for each model and use the best F-value (geometric mean of precision and recall) as the final evaluation metric. As we had multiple models to choose from, we choose the models from each class that perform best on the training data for our reports. For all models this was one of the models with the largest neighborhood, i.e. using 20 neighbors, and the factor loss. It seems the factor loss performed better simply due to its technical efficiency advantage as discussed above. Performance increases monotonically with neighborhood size and Markov random field based approaches to semantic segmentation also increased their performance with larger neighborhoods up to fully connected Markov random fields (Krähenbühl & Koltun, 2012; Chen et al., 2014; 2017) . We thus expect that larger neighborhoods could work even better. Qualitatively, we observe that all our models yield sensible contour maps (see Fig. 3 A ). Additionally, we note that the linear model and Layer 1 of the predseg model tend to produce double contours, i.e. they tend to produce two contours on either side of the contour reported by human subjects with some area between them connected to neither side of the contour. Quantitatively, our models also perform well except for the deeper layers of Predseg 1 (Fig. 3B and Table 1 ). The other models beat most hand-crafted contour detection algorithms that were tested on this benchmark (Canny, 1986; Comaniciu & Meer, 2002; Cour et al., 2005; Felzenszwalb & Huttenlocher, 2004) and perform close to the gPb-owt-ucm contour detection and segmentation algorithm (Arbeláez et al., 2011) that was the state of the art at the time. Layer-0 of Predseg 1 performs best followed by the linear feature model and finally the pixel value model. Interestingly, the best performing models seem to be mostly the local averaging models (cf. Fig. 2 C ). In particular, the high performance of the first layer of Predseg 1 is surprising, because it uses only 3 × 3 pixel local color averages as features. Since the advent of deep neural network models, networks trained to optimize performance on image segmentation have reached much higher performance on the BSDS500 benchmark, essentially reaching perfect performance up to human inconsistency (e.g. He et al., 2019; Kokkinos, 2016; Linsley et al., 2020; Liu et al., 2017; Shen et al., 2015; Su et al., 2021; Xie & Tu, 2015 , see Table 1 ). However, these models all require direct training on human reported contours and often use features learned for other tasks. There are also a few deep neural network models that attempt unsupervised segmentation (e.g. Chen et al., 2019; Lin et al., 2021; Xia & Kulis, 2017 ), but we were unable to find any that were evaluated on the contour task of BSD500. The closest is perhaps the W-net (Xia & Kulis, 2017) , which used an autoencoder structure with additional constraints and was evaluated on the segmentation task on BSDS500 performing slighly better than gPb-owt-ucm.

4. DISCUSSION

We present a model that can learn features and local segmentation information from images without further supervision signals. This model integrates the prediction task used for feature learning and the segmentation task into the same coherent probabilistic framework. This framework and the dual use for the connectivity information make it seem sensible to represent this information. Furthermore, the features learned by our models resemble receptive fields in the retina and primary visual cortex and the contours we extract from connectivity information match contours drawn by human subject fairly well, both without any training towards making them more human-like. To improve biological plausibility, all computations in our model are local and all units are connected to the same small, local set of other units throughout learning and inference, which matches early visual cortex, in which the lateral connections that follow natural image statistics are implemented anatomically (Buzás et al., 2006; Hunt et al., 2011; Roelfsema et al., 1998; Stettler et al., 2002) . This in contrast to other ideas that require flexible pointers to arbitrary locations and features (as discussed by Shadlen & Movshon, 1999) or capsules that flexibly encode different parts of the input (Doerig et al., 2020; Kosiorek et al., 2019; Sabour et al., 2017; 2021) . Nonetheless, we employ contrastive learning objectives and backpropagation here, for which we do not provide a biologically plausible implementations. However, there is currently active research towards biologically plausible alternatives to these algorithms (e.g. Illing et al., 2021; Xiong et al., 2020) . Selecting the neurons that react to a specific object appears to rely on some central resource (Treisman, 1996; Treisman & Gelade, 1980) and to spread gradually through the feature maps (Jeurissen et al., 2013; 2016; Self et al., 2019) . We used a computer vision algorithm for this step, which centrally computes the eigenvectors of the connectivity graph Laplacian (Arbeláez et al., 2011) , which does not immediately look biologically plausible. However, a recent theory for hippocampal place and grid cells suggests that these cells compute the same eigenvectors of a graph Laplacian albeit of a successor representation (Stachenfeld et al., 2014; 2017) . Thus, this might be an abstract description of an operation brains are capable of. In particular, earlier accounts that model the selection as a marker that spreads to related locations (e.g. Finger & König, 2014; Roelfsema, 2006; Singer & Gray, 1995) have some similarities with iterative algorithms to compute eigenvectors. Originally, phase coherence was proposed as a marker (Finger & König, 2014; Peter et al., 2019; Singer & Gray, 1995) , but a simple gain increase within attended objects (Roelfsema, 2006) and a random gain modulation were also proposed (Haimerl et al., 2021; 2019) . Regardless of the mechanistic implementation of the marker, connectivity information of the type our model extracts would be extremely helpful to explain the gradual spread of object selection. Our implementation of the model is not fully optimized, as it is meant as a proof of concept. In particular, we did not optimize the architectures or training parameters of our networks for the task, like initialization, optimization algorithm, learning rate, or regularization. Presumably, better performance in all benchmarks could be reached by adjusting any or all of these parameters. One possible next step for our model would be to train deeper architectures, such that the features could be used for complex tasks like object detection and classification. Contrastive losses like the one we use here are successfully applied for pretraining for large scale tasks such as ImageNet (Russakovsky et al., 2015) or MS Coco (Lin et al., 2015) . These large scale applications often require modifications for better learning (Chen et al., 2020; Feichtenhofer et al., 2021; Grill et al., 2020; He et al., 2020; Hénaff et al., 2020; van den Oord et al., 2019) . For example: Image augmentations to explicitly train networks to be invariant to some image changes, prediction heads that allow more complex distributions for the predictions, and memory banks or other methods to decrease the reliance on many negative samples. For understanding human vision, this line of reasoning opens the exciting possibility that higher visual cortex could be explained based on similar principles, as representations from contrastive learning also yield high predictive power for these cortices (Zhuang et al., 2021) . The model we propose here is a probabilistic model of the feature maps. Based on this model, we could also infer the feature values. Thus, our model implies a pattern how neurons should combine their bottom-up inputs with predictions from nearby other neurons, once we include some uncertainty for the bottom-up inputs. In particular, the combination ought to take into account which nearby neurons react to the same object and which ones do not. Investigating this pooling could provide insights and predictions for phenomena that are related to local averaging. Crowding for example (Balas et al., 2009; Freeman & Simoncelli, 2011; Herzog et al., 2015; Wallis et al., 2016; 2017; 2019) is currently captured best by summary statistic models (Balas et al., 2009; Freeman & Simoncelli, 2011; Wallis et al., 2017) , but deviations from these predictions suggest that object boundaries change processing (Herzog et al., 2015; Wallis et al., 2016; 2019) . Another promising extension of our model would be processing over time, because predictions over time were found to be a potent signal for contrastive learning (Feichtenhofer et al., 2021) and because coherent object motion is among the strongest grouping signals for human observers (Köhler, 1967) and computer vision systems (Yang et al., 2021) . Beside the substantial increases in processing capacity necessary to move to video processing instead of image processing, this step would require some extension of our framework to include object motion into the prediction. Nonetheless, including processing over time seems to be an interesting avenue for future research, especially because segmentation annotations for video are extremely expensive to collect such that unsupervised learning is particularly advantageous and popular in recent approaches (Araslanov et al., 2021; Jabri et al., 2020; Lai et al., 2020) .



i and j thus have two entries each Instead, p(wij = 1) will be higher, because other factors increase the precision for the feature vectors, which makes the normalization constants more similar.



Figure 1: Illustration of our Markov random field model for the feature maps. A: An example input image. B: Feature map with 4 neighborhood connectivity and pixel color as the extracted feature.In the actual models, these feature maps are higher dimensional maps extracted by a convolutional neural network. C: Illustration of the factor that links the feature vectors at two neighboring locations for a 1D feature. Top row: projection of the factor ψ ij onto the difference between the features value f i -f j , showing the combination of a Gaussian around 0 and a constant function for the connection variable w ij being 1 or 0 respectively. Middle row: 2D representation of the factor and its parts plotted against both feature values. Bottom row: Multiplication of the middle row with the standard normal factor for each position yielding the joint distribution of two isolated positions. D: Neighborhoods of different sizes used in the models, scaling from 4 to 20 neighbors for each location.

Figure 2: Example linear filter weights learned by our models. Each individual filter is normalized to minimum 0 and maximum 1. As weights can be negative even a zero weight can lead to a pixel having some brightness. For example, a number of channels load similarly on red and green across positions. Where these weights are positive the filter appears yellow and where the weights are negative filter appears blue, even if the blue channel has a zero weight. A-C: Feature weights learned by the linear model. A: Using the position loss. B: Using the factor loss. C: The weights of the model that leads to the best segmentation performance, i.e. the one shown in Figure 3. D: Weights of the first convolution in predseg1. Next to the filter shapes, which are nearly constant, we plot the average weight of each channel onto the three color channels of the image. E Predseg1 filters in the second convolution for a network trained with the position based loss. F: Predseg1 filters in the second convolution for a network trained with the factor based loss.

Figure 3: Contour detection results. A: Example segmentations from our models. B: Precision-recall curves for our models on the Berkeley segmentation dataset, with some other models for comparison as evaluated by Arbeláez et al. (2011): gPb-uwt-ucm, the final algorithm combining all improvements (Arbeláez et al., 2011), Canny's classical edge detector(Canny, 1986), the mean shift algorithm(Comaniciu & Meer, 2002), Felzenschwalbs algorithm(Felzenszwalb & Huttenlocher, 2004) and segmentation based on normalized cuts(Cour et al., 2005). For all comparison algorithms evaluations on BSDS were extracted from the figure byArbeláez et al. (2011)

Numerical evaluation for various algorithms on the BSDS500 dataset. Precision and recall are only given for ODS, i.e. with a the threshold fixed across the whole dataset. Evaluation of these algorithms taken fromArbeláez et al. (2011).

A SUPPLEMENTARY MATERIAL: TRAINING DETAILS

We trained 24 networks of each of the three types. The versions differed in the size of the neighborhood (4, 8, 12, or 20 neighbors) , the amount of noise added (α ∈ 0, 0.1, 0.2), and the used loss (position or factor loss).The parameters we trained were:• all weights of the underlying network• the logit transform of p for each relative position of two neighbors• the logarithms of the diagonal entries of C for each relative position of neighbors We trained models using the standard stochastic gradient descent implemented in pytorch (Paszke et al., 2019) with a learning rate of 0.001, a momentum of 0.9 and a slight weight decay of 0.0001.To speed up convergence we increased the learning rate by a factor of 10 for the parameters of the prediction, i.e. C and p. For the gradient accumulation for the position based loss, we accumulate 5 repetitions for the pixel model and 10 for the linear model and for predseg1. Each repetition contained 10 random negative locations. Batch size was set to fit onto the smaller GPU type used in our local cluster. The resulting sizes are listed in Table 2 A.1 ARCHITECTURE DETAILS The pixel model was implemented as a single Identity layer.The linear model was implemented as a single 50 × 11 × 11 convolutional layer.The Predseg1 model was implemented as a sequential model with 4 processing steps separated by subsampling layers (1 × 1 convolutional layers with a stride > 1). The first processing step was a 3 × 3 convolutional layer with 3 channels followed by subsampling by a factor of 3. The second step was a 11 × 11 convolutional layer with 64 features followed by subsampling by a factor of 2. The third and fourth steps were residual processing blocks, i.e. two convolutional layers with a rectified linear unit non-linearity between them whose results were added to the inputs. They had 128 and 256 features respectively and were separated by another subsampling by a factor of 2.

A.2 ADDED NOISE

To prevent individual features dimensions from becoming perfectly predictive, we added a small amount of Gaussian noise to the feature maps before applying the loss. To yield variables with mean 0 and variance 1 after adding the noise we implemented this step as:where α ∈ [0, 1] controls the noise variance and ϵ is a standard normal random variable.Adding this noise did not change any of our results substantially and the three versions with different amounts of noise (α = 0, 0.1 or 0.2) performed within 1 -2% in all performance metrics.

A.3 TRAINING DURATION

Networks were trained in training jobs that were limited to either 48 hours of computation time or 10 epochs of training. As listed in table 2, we used a single such job for the pixel models, 7 for the linear models and 9 for the predseg1 models. Most larger networks were limited by the 48 hour limit, not by the epoch limit.

A.4 USED COMPUTATIONAL RESOURCES

The vast majority of the computation time was used for training the network parameters. Computing segmentations for the BSDS500 images and evaluating them took only a few hours of pure CPU processing.Under review as a conference paper at ICLR 2023 Networks were trained on an internal cluster using one GPU at a time and 6 CPUs for data loading.We list the training time per epoch in table 2 . If every job had run for the full 48 hours we would have used (1 + 7 + 9) × 24 × 2 = 816 days of GPU processing time, which is a relatively close upper bound on the time we actually used.

A.5 COMPARISON OF THE TWO LOSSES

The position loss is consistent with the prediction made by the whole Markov random field, but is relatively inefficient, because the predicted distribution p(f i |f j ∀j ∈ N (i)) and the normalization constants for these conditional distributions are different for every location i. Thus, the second term in equation ( 9) cannot be reused across the locations i. Instead, we need to compute the second term for each location separately, which requires a similar amount of memory as the whole feature representation for each negative sample i ′ and each neighbor.To enable a sufficiently large set of negative points i ′ with the available memory, we compute this loss multiple times with few negative samples and sum the gradients. This trick saves memory, because we can free the memory for the loss computation after each repetition. As the initial computation of the feature maps is the same for all negative samples, we save some computation for this procedure by computing the feature maps only once. To propagate the gradients through this single computation, we add up the gradients of the loss repetitions with regard to the feature maps and then propagate this summed gradient through the feature map computation. This procedure does not save computation time compared to the loss with many negative samples, as we still need to calculate the evaluation for each position and each sample in the normalization set.The factor loss does not lead to a consistent estimation of the MRF model, because the prediction p(f i |f j ) should not be based only on the factor ψ ij , but should include indirect effects as f j also constrains the other neighbors of i. Optimizing each factor separately will thus overaccount for information that could be implemented in two factors. However, the factor loss has the distinct advantage that the same noise evaluations can be used for all positions and images in a minibatch, which enables a much larger number of noise samples and thus much faster learning.

