E-CRF: EMBEDDED CONDITIONAL RANDOM FIELD FOR BOUNDARY-CAUSED CLASS WEIGHTS CONFUSION IN SEMANTIC SEGMENTATION

Abstract

Modern semantic segmentation methods devote much effect to adjusting image feature representations to improve the segmentation performance in various ways, such as architecture design, attention mechnism, etc. However, almost all those methods neglect the particularity of class weights (in the classification layer) in segmentation models. In this paper, we notice that the class weights of categories that tend to share many adjacent boundary pixels lack discrimination, thereby limiting the performance. We call this issue Boundary-caused Class Weights Confusion (BCWC). We try to focus on this problem and propose a novel method named Embedded Conditional Random Field (E-CRF) to alleviate it. E-CRF innovatively fuses the CRF into the CNN network as an organic whole for more effective end-to-end optimization. The reasons are two folds. It utilizes CRF to guide the message passing between pixels in high-level features to purify the feature representation of boundary pixels, with the help of inner pixels belonging to the same object. More importantly, it enables optimizing class weights from both scale and direction during backpropagation. We make detailed theoretical analysis to prove it. Besides, superpixel is integrated into E-CRF and served as an auxiliary to exploit the local object prior for more reliable message passing. Finally, our proposed method yields impressive results on ADE20K, Cityscapes, and Pascal Context datasets.

1. INTRODUCTION

Semantic segmentation plays an important role in practical applications such as autonomous driving, image editing, etc. Nowadays, numerous CNN-based methods (Chen et al., 2014; Fu et al., 2019; Ding et al., 2019) have been proposed. They attempt to adjust the image feature representation of the model itself to recognize each pixel correctly. However, almost all those methods neglect the particularity of class weights (in the classification layer) that play an important role in distinguishing pixel categories in segmentation models. Hence, it is critical to keep class weights discriminative. Unfortunately, CNN models have the natural defect for this. Generally speaking, most discriminative higher layers in the CNN network always have the larger receptive field, thus pixels around the boundary may obtain confusing features from both sides. As a result, these ambiguous boundary pixels will mislead the optimization direction of the model and make the class weights of such categories that tend to share adjacent pixels indistinguishable. For the convenience of illustration, we call this issue as Boundary-caused Class Weights Confusion (BCWC). We take DeeplabV3+ (Chen et al., 2018a) as an example to train on ADE20K (Zhou et al., 2017) dataset. Then, we count the number of adjacent pixels for each class pair and find a corresponding category that has the most adjacent pixels for each class. Fig 1(a) shows the similarity of the class weight between these pairs in descending order according to the number of adjacent pixels. It is clear that if two categories share more adjacent pixels, their class weights tend to be more similar, which actually indicates that BCWC makes class representations lack discrimination and damages the overall segmentation performance. 2015) go a step further to unify the segmentation model and CRF in a single pipeline for end-to-end training. We call it Joint-CRF for simplicity. Same as Vanilla-CRF, Joint-CRF inclines to rectify those misclassified boundary pixels via increasing the prediction score of the associated category, which means it still operates on the object class probabilities. But it can alleviate the BCWC problem to some extent as the probability score refined by CRF directly involves in the model backpropagation. Afterwards, the disturbing gradients caused by those pixels will be relieved, which will promote the class representation learning. However, as shown in Fig 3 , the effectiveness of Joint-CRF is restricted as it only optimizes the scale of the gradient and lacks the ability to optimize class representations effectively due to the defective design. More theoretical analysis can be found in Sec. 3.3. To overcome the aforementioned drawbacks, in this paper, we present a novel approach named Embedded CRF (E-CRF) to address the BCWC problem more effectively. The superiority of E-CRF lies in two main aspects. On the one hand, by fusing CRF mechanism into the segmentation model, E-CRF utilizes the local consistency among original image pixels to guide the message passing of high-level features. Each pixel pair that comes from the same object tends to obtain higher message passing weights. Therefore, the feature representation of the boundary pixels can be purified by the corresponding inner pixels from the same object. In turn, those pixels will further contribute to the discriminative class representation learning. On the other hand, it extends the fashion of optimizing class weights from one perspective (i.e., scale) to two (i.e., scale and direction) during 1 These methods improve boundary segemetation and may have effect on class weights. But they are not explicit and lack theoretical analysis. We show great benifit of explicitly considering BCWC issue. See A.4.1.



Figure 1: (a) Observations on ADE20K. We find a corresponding category that shares the most adjacent pixels for each class and calculate the similarity of their class weights. X-axis stands for the number of adjacent pixels for each class pair in descending order, and Y-axis represents the similarity of their class weights. Blue line denotes baseline model while orange line denotes E-CRF. Apparently, two categories that share more adjacent pixels are inclined to have more similar class weights, while E-CRF effectively decreases the similarity between adjacent categories and makes their class weights more discriminative. (b) Message passing procedure of E-CRF. F is the original feature maps of the CNN network. E-CRF utilizes pairwise module ψ f p and auxiliary superpixel-based module ψ f s on F to obtain refined feature maps F p and F s respectively. Then F , F p and F s are fused as F * to further segment the image. Considering the inherent drawback of CNN networks mentioned before, delving into the relationship between raw pixels becomes a potential alternative to eliminate the BCWC problem, and Conditional Random Field (CRF) (Chen et al., 2014) stands out. It is generally known that pixels of the same object tend to share similar characteristics in the local area. Intuitively, CRF utilizes the local consistency between original image pixels to refine the boundary segmentation results with the help of inner pixels of the same object. CRF makes some boundary pixels that are misclassified by the CNN network quite easy to be recognized correctly. But these CRF-based methods (Chen et al., 2014; Zhen et al., 2020a) only adopt CRF as an offline post-processing module, we call it Vanilla-CRF, to refine the final segmentation results. They are incapable of relieving BCWC problem as CRF and the CNN network are treated as two totally separate modules. Based on Chen et al. (2014; 2017a), Lin et al. (2015); Arnab et al. (2016); Zheng et al. (2015) go a step further to unify the segmentation model and CRF in a single pipeline for end-to-end training. We call it Joint-CRF for simplicity. Same as Vanilla-CRF, Joint-CRF inclines to rectify those misclassified boundary pixels via increasing the prediction score of the associated category, which means it still operates on the object class probabilities. But it can alleviate the BCWC problem to some extent as the probability score refined by CRF directly involves in the model backpropagation. Afterwards, the disturbing gradients caused by those pixels will be relieved, which will promote the class representation learning. However, as shown in Fig3, the effectiveness of Joint-CRF is restricted as it only optimizes the scale of the gradient and lacks the ability to optimize class representations effectively due to the defective design. More theoretical analysis can be found in Sec. 3.3.

