CONTEXTUAL IMAGE PARSING VIA PANOPTIC SEGMENT SORTING Anonymous

Abstract

Visual context is versatile and hard to describe or label precisely. We aim to leverage the densely labeled task, image parsing, a.k.a panoptic segmentation, to learn a model that encodes and discovers object-centric context. Most existing approaches based on deep learning tackle image parsing via fusion of pixel-wise classification and instance masks from two sub-networks. Such approaches isolate things from stuff and fuse the semantic and instance masks in the later stage. To encode object-centric context inherently, we propose a metric learning framework, Panoptic Segment Sorting, that is directly trained with stuff and things jointly. Our key insight is to make the panoptic embeddings separate every instance so that the model automatically learns to leverage visual context as many instances across different images appear similar. We show that the context of our model's retrieved instances is more consistent relatively by 13.7%, further demonstrating its ability to discover novel context unsupervisedly. Our overall framework also achieves competitive performance across standard panoptic segmentation metrics amongst the state-of-the-art methods on two large datasets, Cityscapes and PASCAL VOC. These promising results suggest that pixel-wise embeddings can not only inject new understanding into panoptic segmentation but potentially serve for other tasks such as modeling instance relationships.

1. INTRODUCTION

Visual context is versatile and hard to describe or label precisely, yet it is critical for humans (Medin & Schaffer, 1978) to recognize objects quickly. More importantly, objects in different contexts carry different meanings. For example, pedestrians walking in crosswalks should receive more attention than on sidewalks to a driver. However, it is almost impossible to categorize objects with different contexts as the change can be subtile yet dramatic: A pedestrian is more likely in danger if walking in front of a car than by a car, where both a person and a car appear together. We are thus motivated to propose a model that automatically encodes and discovers object visual context by leveraging a densely labeled task, panoptic segmentation. Panoptic segmentation (Kirillov et al., 2019b ), a.k.a., image parsing (Tu et al., 2005) , is to segment an image into its constituent visual patterns with both semantic and instance labels. The major challenge lies in delineating different instances while associating them with semantic categories. For example, one has to segment two side-by-side cars apart while still being able to classify them as the same category. Most existing approaches (Kirillov et al., 2019a; Xiong et al., 2019; Yang et al., 2019; Cheng et al., 2020; Li et al., 2020) tackle these two aspects via two sub-networks, instance and semantic segmentation branches. The advantage of such approaches is that each branch can cater to one aspect and achieves high performance. Additional modules for integrating things and stuff are needed to resolve the disagreements between two branches. Yet for object visual context, things and stuff are two integral parts. Hence, we aim to propose a framework that unifies these two seemingly competing aspects and thus encodes visual context inherently. Our framework is inspired by the perceptual organization view (Biederman, 1987) : Humans perceive a scene by breaking it down into visual groups and structures; repeated structures are then associated for cognitive recognition. Our key insight is to separate everything first and group visually similar components later. The grouping takes place within an image and across images. Within an image, visually similar segments are merged to form instances; across images, visually familiar segments are associated to create semantics, as illustrated in Fig. 1 . We carry out this idea by building an end-to-end trained pixel-wise embedding framework. Each pixel in an image is mapped via a CNN to a feature in latent space, and nearby features indicate pixels belonging to the same instance. This framework is therefore a non-parametric model at the segment and instance levels as its complexity scales with number of segments and instances, i.e., exemplars. Particularly, by forcing all the instances to separate, the model has to utilize all the possible visual and semantic information. The model thus learns to separate instances by not only their appearances but also their surroundings, or visual context. A major difference between our model and others is the metric learning perspective: Our model trains with a contrastive loss that captures pixelto-segment relationships while others trains with pixel-wise classification that predicts category or instance directly. As a result, the learned panoptic embeddings can discover instances under similar context, as in Figure 1 bottom row. Specifically, we adapt the Segment Sorting approach (Hwang et al., 2019b) to panoptic segmentation by sorting segments according to both of its semantic and instance labels, hence dubbed Panoptic Segment Sorting (PSS). Such trained pixel-wise embeddings thus encode both semantic and instance information. We then predict each segment's semantic label by simply mapping and classifying its prototype feature with a softmax classifier. We also propose a corresponding clustering algorithm to merge segments into instances with a nearest neighbor criterion (Sarfraz et al., 2019) . To alleviate the problem of instances with various scales, we further equip our framework with hybrid scale exemplars during training and dynamic partitioning during inference. Finally, we facilitate the merging process with a seeding branch that predicts the center of each instance. As a result, we demonstrate that the contexts of instances retrieved by our panoptic embeddings are more consistent relatively by 13.7% while achieving competitive performance amongst the stateof-the-art on two datasets, Cityscapes (Cordts et al., 2016) and PASCAL VOC (Everingham et al.) . These promising results suggest that Panoptic Segment Sorting or pixel-wise embeddings can not only inject new understanding into panoptic segmentation but potentially serve as a foundation for other tasks such as discovering novel contexts or modeling instance relationships.

2. RELATED WORK

Image parsing and panoptic segmentation. The task of image parsing is first introduced in Tu et al. (2005) , where they formulate the solution in a Bayesian framework and construct a parsing



Figure1: Top row from left to right: input image, panoptic embeddings, panoptic predictions, and panoptic labels. We overlay panoptic embeddings with the resultant over-segmentation boundaries. Middle row: After extracting panoptic embeddings from a CNN and the resultant over-segmentation, we use the segment prototype features to find nearest neighbors, within the image (middle) or across images (right), of each query segment (in red). These retrieval results probe what's learned in the embedding space. Bottom row: an example of contex specific instance retrieval results, where pedestrians crossing an intersection are discovered unsupervisedly.

