MEDOE: A MULTI-EXPERT DECODER AND OUTPUT ENSEMBLE FRAMEWORK FOR LONG-TAILED SEMAN-TIC SEGMENTATION

Abstract

Long-tailed distribution of semantic categories, which has been often ignored in conventional methods, causes unsatisfactory performance in semantic segmentation on tail categories. In this paper, we focus on the problem of long-tailed semantic segmentation. Although some long-tailed recognition methods (e.g., resampling/re-weighting) have been proposed in other problems, they are likely to compromise crucial contextual information in semantic segmentation. Therefore, these methods are hardly adaptable to the problem of long-tailed semantic segmentation. To address this problem, we propose a novel method, named MEDOE, by ensembling and grouping contextual information. Specifically, our MEDOE is a two-sage framework comprising a multi-expert decoder (MED) and a multiexpert output ensemble (MOE). The MED includes several "experts", each of which takes as input the dataset masked according to the specific categories based on frequency distribution and generates contextual information self-adaptively for classification. The MOE then ensembles the experts' outputs with learnable decision weights. As a model-agnostic framework, MEDOE can be flexibly and efficiently coupled with various popular deep neural networks (e.g., Deeplabv3+, OCRNet, and PSPNet) to improve the performance in long-tailed semantic segmentation. Experimental results show that the proposed framework outperforms the current methods on both Cityscapes and ADE20K datasets by up to 2% in mIoU and 6% in mAcc.

1. INTRODUCTION

Semantic segmentation is defined as the task to predict the semantic category for each pixel in an image. As a fundamental computer vision task, semantic segmentation is of great importance in various real-world applications (e.g., clinical analysis and automatic driving). Conventional methods follow the backbone-context module architecture, where the context module contributes significantly and can be modified to enhance the extraction and aggregation of surrounding pixels and global information through large-scale field convolution model or attention mechanisms (Chen et al., 2018; He et al., 2019; Huang et al., 2019; Yuan et al., 2019; Zhao et al., 2017) . Such modifications have enabled the recent state-of-the-art methods to improve the performance in semantic segmentation on various benchmark datasets (Caesar et al., 2018; Ding et al., 2019; Zhu et al., 2019) . Despite the overall impressive performance, the above-mentioned semantic segmentation methods still face challenges from the perspective of data distribution. For example, Table 1 shows the results for head, body, and tail categories (i.e., categories ranked from top to bottom by pixel frequency) on benchmark datasets using DeepLabv3+ (He et al., 2016) , where positive correlations can be found between results and data distribution. In other words, the performance declines for tail categories. This suggests the problem of long-tailed distribution in Figure 1 : Example of semantic segmentation on Cityscapes, where tail categories (e.g., "pole" and "wall") are not well segmented. From left to right: street scene image, ground truth, and segmentation map using DeepLabv3+. semantic segmentation: a few head categories dominate the majority of pixels, whereas many tail categories correspond to significantly less pixels. Specifically, if all categories are processed in the same pattern while a long-tailed distribution exists, head categories may excessively influence the training and negatively impact on learning the contextual information about tail ones, leading to unsatisfactory pixel-level classification. Solving the problem of long-tailed semantic segmentation is critical to real-world scenarios. As shown in Figure 1 , distinguishing poles from street scene images may prevent potential accidents. To the best of our knowledge, this paper is the first to explicitly focus on the long-tailed semantic segmentation, and aims to provide a straightforward solution by extending the existent recognition methods, which adopt various strategies to solve the problem of long-tailed distribution in image classification (Buda et al., 2018; Huang et al., 2016; Gupta et al., 2019) . These methods can be generally categorized into re-weighting, re-sampling, and ensembling and grouping. The re-weighting methods increase the weights of tail categories but decrease those of head ones (Cao et al., 2019b; Liu et al., 2019) , following the assumption that images are nearly independent and identically (i.e., i.i.d.) distributed to address the imbalance of training set. This assumption enables the classification accuracy for each category to depend on the frequency of the corresponding images (Cao et al., 2019b; Cui et al., 2022) . The re-sampling methods conduct under-sampling for the head categories and over-sampling or even data augmentation for the tail ones (He & Garcia, 2009; Kim et al., 2020; Chu et al., 2020) , usually following a random sampling strategy to ensure fairness. The ensembling and grouping methods start with training a feature extractor on the whole of an imbalanced dataset as representation learning, then adjust the margins of classifiers using multi-expert frameworks (Xiang et al., 2020; Zhang et al., 2021; Zhou et al., 2020) for re-balancing (Kang et al., 2019) . Nevertheless, the afore-mentioned recognition methods for long-tailed image classification can hardly be adapted for long-tailed semantic segmentation. Specifically, the re-weighting methods are not able to serve pixel-level classification as a pixel is usually highly correlated to the surrounding ones (i.e., not i.i.d.) given the contextual information in the image (Lin et al., 2017) . They may also cause a see-saw phenomenon: the accuracy for head categories is compromised whereas tail categories are on purpose emphasized. Based on random sampling, the re-sampling methods lead to a large number of independent pixels that undermine the contextual information of an image and can be detrimental to semantic segmentation. The ensembling and grouping methods rely heavily on re-balancing, where classifier re-adjusting is not well adapted for semantic segmentation due to ignoring the difference among head, body, and tail categories in contextual information. Motivated by the observation about the ensembling and grouping methods, we propose MEDOE, a two-stage multi-expert decoder and output ensemble framework for long-tailed semantic segmentation. At Stage 1, the feature map extracted by a backbone trained on the whole of an imbalanced dataset, which represents the elementary knowledge, is first passed to a multi-expert decoder (MED). Each expert (i.e., a pair of context module and classifier head) in the MED works on the dataset where pixels and their corresponding labels of dominant categories (i.e., body and tail categories) in each image are masked. Together with other constraints, this expert-specific pixel-masking strategy enables the experts to reduce the impact of head categories and irrelevant pixels and acquire the contextual information of body or tail categories. The following Stage 2 deploys a multi-expert output ensemble (MOE) to ensemble the outputs of all experts from Stage 1 using decision weights. Instead of being user-specified, these weights are learned by a decision-maker to avoid the negative impact of the constraints. The proposed framework is model-agnostic and can be integrated with any popular semantic segmentation method, such as DeepLabv3+ (Chen et al., 2018 ), PSPNet (Zhao et al., 2017 ), OCRNet (Yuan et al., 2019) and so on.





