CONTEXTUAL IMAGE PARSING VIA PANOPTIC SEGMENT SORTING Anonymous

Abstract

Visual context is versatile and hard to describe or label precisely. We aim to leverage the densely labeled task, image parsing, a.k.a panoptic segmentation, to learn a model that encodes and discovers object-centric context. Most existing approaches based on deep learning tackle image parsing via fusion of pixel-wise classification and instance masks from two sub-networks. Such approaches isolate things from stuff and fuse the semantic and instance masks in the later stage. To encode object-centric context inherently, we propose a metric learning framework, Panoptic Segment Sorting, that is directly trained with stuff and things jointly. Our key insight is to make the panoptic embeddings separate every instance so that the model automatically learns to leverage visual context as many instances across different images appear similar. We show that the context of our model's retrieved instances is more consistent relatively by 13.7%, further demonstrating its ability to discover novel context unsupervisedly. Our overall framework also achieves competitive performance across standard panoptic segmentation metrics amongst the state-of-the-art methods on two large datasets, Cityscapes and PASCAL VOC. These promising results suggest that pixel-wise embeddings can not only inject new understanding into panoptic segmentation but potentially serve for other tasks such as modeling instance relationships.

1. INTRODUCTION

Visual context is versatile and hard to describe or label precisely, yet it is critical for humans (Medin & Schaffer, 1978) to recognize objects quickly. More importantly, objects in different contexts carry different meanings. For example, pedestrians walking in crosswalks should receive more attention than on sidewalks to a driver. However, it is almost impossible to categorize objects with different contexts as the change can be subtile yet dramatic: A pedestrian is more likely in danger if walking in front of a car than by a car, where both a person and a car appear together. We are thus motivated to propose a model that automatically encodes and discovers object visual context by leveraging a densely labeled task, panoptic segmentation. Panoptic segmentation (Kirillov et al., 2019b) , a.k.a., image parsing (Tu et al., 2005) , is to segment an image into its constituent visual patterns with both semantic and instance labels. The major challenge lies in delineating different instances while associating them with semantic categories. For example, one has to segment two side-by-side cars apart while still being able to classify them as the same category. Most existing approaches (Kirillov et al., 2019a; Xiong et al., 2019; Yang et al., 2019; Cheng et al., 2020; Li et al., 2020) tackle these two aspects via two sub-networks, instance and semantic segmentation branches. The advantage of such approaches is that each branch can cater to one aspect and achieves high performance. Additional modules for integrating things and stuff are needed to resolve the disagreements between two branches. Yet for object visual context, things and stuff are two integral parts. Hence, we aim to propose a framework that unifies these two seemingly competing aspects and thus encodes visual context inherently. Our framework is inspired by the perceptual organization view (Biederman, 1987) : Humans perceive a scene by breaking it down into visual groups and structures; repeated structures are then associated for cognitive recognition. Our key insight is to separate everything first and group visually similar components later. The grouping takes place within an image and across images. Within an image,

