REINFORCEMENT LEARNING FOR INSTANCE SEGMEN-TATION WITH HIGH-LEVEL PRIORS Anonymous

Abstract

Instance segmentation is a fundamental computer vision problem which remains challenging despite impressive recent advances due to deep learning-based methods. Given sufficient training data, fully supervised methods can yield excellent performance, but annotation of groundtruth remains a major bottleneck, especially for biomedical applications where it has to be performed by domain experts. The amount of labels required can be drastically reduced by using rules derived from prior knowledge to guide the segmentation. However, these rules are in general not differentiable and thus cannot be used with existing methods. Here, we revoke this requirement by using stateless actor critic reinforcement learning, which enables non-differentiable rewards. We formulate the instance segmentation problem as graph partitioning and the actor critic predicts the edge weights driven by the rewards, which are based on the conformity of segmented instances to high-level priors on object shape, position or size. The experiments on toy and real data demonstrate that a good set of priors is sufficient to reach excellent performance without any direct object-level supervision.

1. INTRODUCTION

Instance segmentation is the task of segmenting all objects in an image and assigning each of them a different id. It is the necessary first step to analyze individual objects in a scene and is thus of paramount importance in many computer vision applications. Over the recent years, fully supervised instance segmentation methods have made tremendous progress both in natural image applications and in scientific imaging, achieving excellent segmentations for very difficult tasks (Chen, Wang, and Qiao 2021; Lee et al. 2017) . A large corpus of training images is hard to avoid when the segmentation method needs to take into account the full variability of the natural world. However, in many practical segmentation tasks the appearance of the objects can be expected to conform to certain rules that are known a priori. Examples include surveillance, industrial quality control and especially medical and biological imaging applications where full exploitation of such prior knowledge is particularly important as the training data is sparse and difficult to acquire: pixelwise annotation of the necessary instance-level groundtruth for a microscopy experiment can take weeks or even months of expert time. The use of shape priors has a strong history in this domain (Delgado-Gonzalo et al. 2014; Osher and Paragios 2007) , but the most powerful learned shape models still require groundtruth (Oktay et al. 2018 ) and generic shapes are hard to combine with the CNN losses and other, non-shape, priors. For many high-level priors it has already been demonstrated that integration of the prior directly into the CNN loss can lead to superior segmentations while significantly reducing the necessary amounts of training data (Kervadec et al. 2019) . However, the requirement of formulating the prior as a differentiable function poses a severe limitation on the kinds of high-level knowledge that can be exploited with such an approach. Our contribution addresses this limitation and establishes a framework in which a rich set of non-differentiable rules and expectations can be used to steer the network training. To circumvent the requirement of a differentiable loss function, we turn to the reinforcement learning paradigm, where the rewards can be computed from a non-differentiable cost function. We base our framework on a stateless actor-critic setup (Pfau and Vinyals 2016), providing one of the first practical applications of this important theoretical construct. In more detail, we solve the instance segmentation problem as agglomeration of image superpixels, with the agent predicting the weights of the edges in the superpixel region adjacency graph. Based on the predicted weights, the segmentation is obtained through (non-differentiable) graph partitioning. The segmented objects are evaluated by the critic, which learns to approximate the rewards based on object-and image-level reasoning (see Fig. 1 ). The main contributions of this work can be summarized as follows: (i) we formulate instance segmentation as a RL problem based on a stateless actor-critic setup, encapsulating the non-differentiable step of instance extraction into the environment and thus achieving end-to-end learning; (ii) we do not use annotated images for supervision and instead exploit prior knowledge on instance appearance and morphology by tying the rewards to the conformity of the predicted objects to pre-defined rules and learning to approximate the (non-differentiable) reward function with the critic; (iii) we introduce a strategy for spatial decomposition of rewards based on fixed-sized subgraphs to enable localized supervision from combinations of object-and image-level rules. (iv) we demonstrate the feasibility of our approach on synthetic and real images and show an application to two important segmentation tasks in biology. In all experiments, our framework delivers excellent segmentations with no supervision other than high-level rules.

2. RELATED WORK

Reinforcement learning has so far not found significant adoption in the segmentation domain. The closest to our work are two methods in which RL has been introduced to learn a sequence of segmentation decision steps as a Markov Decision Process. In the actor critic framework of Araslanov, Rothkopf, and Roth 2019, the actor recurrently predicts one instance mask at a time based on the gradient provided by the critic. The training needs fully segmented images as supervision and the overall system, including an LSTM sub-network between the encoder and the decoder, is fairly complex. In Jain et al. 2011, the individual decision steps correspond to merges of clusters while their sequence defines a hierarchical agglomeration process on a superpixel graph. The reward function is based on Rand index and thus not differentiable, but the overall framework requires full (super)pixelwise supervision for training. Reward decomposition was introduced for multi agent RL by Sunehag et al. 2017 where a global reward is decomposed into a per agent reward. Bagnell and Ng 2006 proves that a stateless RL setup with decomposed rewards requires far less training samples than a RL setup with a global reward. In Xu et al. 2019 reward decomposition is applied both temporally and spatially for zero-shot inference on unseen environments by training on locally selected samples to learn the underlying physics of the environment. The restriction to differentiable losses is present in all application domains of deep learning. Common ways to address it are based on a soft relaxation of the loss that can be differentiated. The relaxation can be designed specifically for the loss, for example, Area-under-Curve (Eban et al. 2017) for classification or Jaccard Index (Berman, Triki, and Blaschko 2018) for semantic segmentation. These approaches are not directly applicable to our use case as we aim to use a variety of object-and image-level priors, which should be combined without handcrafting an approximate loss for each case. More generally, but still for a concrete task loss, Direct Loss Minimization has been proposed in Y. Song et al. 2016 . For semi-supervised learning of a classification or ranking task, Discriminative Adversarial Networks have been proposed as a means to learn an approximation to the loss (Santos, Wadhawan, and Zhou 2017). Most generally, Grabocka, Scholz, and Schmidt-Thieme 2019 propose to train a surrogate neural network which will serve as a smooth approximation of the true loss. In our setup, the critic can informally be viewed as a surrogate network as it learns to approximate the priors through the rewards by Q-learning. Incorporation of rules and priors is particularly important in biomedical imaging applications, where such knowledge can be exploited to augment or even substitute scarce groundtruth annotations. For example, the shape prior is explicitly encoded in popular nuclear (Schmidt et al. 2018 



) and cellular (Stringer et al. 2021) segmentation algorithms based on spatial embedding learning. Learned non-linear representations of the shape are used in Oktay et al. 2018, while in Hu et al. 2019 the loss for object boundary prediction is made topology-aware. Domain-specific priors can also be exploited in post-processing by graph partitioning (Pape et al. 2019). Interestingly, the energy minimization procedure underlying the graph partitioning can also be incorporated into the learning step (Abbas and Swoboda 2021; Maitin-Shepard et al. 2016; J. Song et al. 2019).

