CAST: CONCURRENT RECOGNITION AND SEGMENTATION WITH ADAPTIVE SEGMENT TOKENS

Abstract

Recognizing an image and segmenting it into coherent regions are often treated as separate tasks. Human vision, however, has a general sense of segmentation hierarchy before recognition occurs. We are thus inspired to learn image recognition with hierarchical image segmentation based entirely on unlabeled images. Our insight is to learn fine-to-coarse features concurrently at superpixels, segments, and full image levels, enforcing consistency induced segmentations while maximizing discrimination among image instances. Our model innovates vision transformers in three aspects. 1) We use adaptive segment tokens instead of fixed-shape patch tokens. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size and reducing the number of tokens. 3) We produce hierarchical image segmentation for free while training for recognition by maximizing image-wise discrimination. Our work delivers the first concurrent recognition and hierarchical segmentation model without any supervision. Validated on ImageNet and PASCAL VOC, it achieves better recognition and segmentation with higher computational efficiency.

1. INTRODUCTION

Convolutional neural networks (CNN) (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016) and Vision Transformers (ViT) (Dosovitskiy et al., 2020) have been very successful in computer vision. However, recognizing an image and segmenting it into coherent regions are treated as separate tasks or learned sequentially (Martin et al., 2001) . Fig. 1 illustrates a common practice: CNN (ViT) predicts the semantic class of an image based on the image-level feature from the output of the final convolutional layer (transformer block), and additional clustering based on earlier pixel-wise features is required to generate image segmentation (Hwang et al., 2019; Ke et al., 2022) . However, human vision has a general sense of segmentation hierarchy, in terms of groups of pixels or segments, before recognition even occurs. This perceptual organization perspective (Witkin & Tenenbaum, 1983; Biederman, 1987) has been overlooked in CNN and ViT architectures: models optimized for image classification tend to latch onto discriminative parts (Selvaraju et al., 2017) such as faces, often missing inconspicuous body parts that go with the face. Previous methods seldom model how different parts such as face and body are organized for the whole animal explicitly. To understand the connections between parts and wholes, visual information must be extracted locally and globally. There are three major approaches (Fig. 2): 1. Spatial downsampling: With pixels laid on a regular grid, features are extracted from patches. The granularity of visual information is determined by the patch size. 



Inspired by Natural Language Processing (NLP), image patches are treated as visual word tokens of the entire image document. To extract more global information, ViT contextually updates feature representations based on pair-wise correlation among all the tokens of an image,

