CAST: CONCURRENT RECOGNITION AND SEGMENTATION WITH ADAPTIVE SEGMENT TOKENS

Abstract

Recognizing an image and segmenting it into coherent regions are often treated as separate tasks. Human vision, however, has a general sense of segmentation hierarchy before recognition occurs. We are thus inspired to learn image recognition with hierarchical image segmentation based entirely on unlabeled images. Our insight is to learn fine-to-coarse features concurrently at superpixels, segments, and full image levels, enforcing consistency induced segmentations while maximizing discrimination among image instances. Our model innovates vision transformers in three aspects. 1) We use adaptive segment tokens instead of fixed-shape patch tokens. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size and reducing the number of tokens. 3) We produce hierarchical image segmentation for free while training for recognition by maximizing image-wise discrimination. Our work delivers the first concurrent recognition and hierarchical segmentation model without any supervision. Validated on ImageNet and PASCAL VOC, it achieves better recognition and segmentation with higher computational efficiency.

1. INTRODUCTION

Convolutional neural networks (CNN) (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016) and Vision Transformers (ViT) (Dosovitskiy et al., 2020) have been very successful in computer vision. However, recognizing an image and segmenting it into coherent regions are treated as separate tasks or learned sequentially (Martin et al., 2001) . Fig. 1 illustrates a common practice: CNN (ViT) predicts the semantic class of an image based on the image-level feature from the output of the final convolutional layer (transformer block), and additional clustering based on earlier pixel-wise features is required to generate image segmentation (Hwang et al., 2019; Ke et al., 2022) . However, human vision has a general sense of segmentation hierarchy, in terms of groups of pixels or segments, before recognition even occurs. This perceptual organization perspective (Witkin & Tenenbaum, 1983; Biederman, 1987) has been overlooked in CNN and ViT architectures: models optimized for image classification tend to latch onto discriminative parts (Selvaraju et al., 2017) such as faces, often missing inconspicuous body parts that go with the face. Previous methods seldom model how different parts such as face and body are organized for the whole animal explicitly. To understand the connections between parts and wholes, visual information must be extracted locally and globally. There are three major approaches (Fig. to the image and vary in shape. We unify fine-to-coarse feature learning at multiple levels in a single model to support not only recognition with maximum image-wise discrimination, but also segmentation with consistency across the hierarchy. Consequently, we achieve better recognition and segmentation with higher computational efficiency. using attention modules (Vaswani et al., 2017) . However, ViT is computationally inefficient as all image tokens are kept in every transformer block. 3. Significance-based subsampling: To increase ViT's computational efficiency, tokens are subsampled at higher levels based on their significance scores. PoWER-BERT (Goyal et al., 2020) and Token Pooling (Marin et al., 2021) define the significance score as the total attention given to each token from all other tokens. Downsampling then retains only the most dominant visual features in the image. Such methods only keep the most informative tokens in final output representations. These existing methods have two major issues. 1) Both CNN and ViT models take regularly shaped patch features as inputs, regardless of what is in the image. Image segmentation derived from such representations often fails to align with contours. 2) Image segmentation does not involve local-toglobal feature extraction, which is treated as a separate visual task from image-wise recognition. Our first insight is that pixel groupings are not a computational inconvenience (as opposed to regular patches), but a natural structure to be exploited for better visual computing. Unlike existing CNN and ViT which extract features on a regular grid throughout the entire model, we directly get to low-level pixel groupings at an early stage and develop feature representations subsequently. Our model takes segment features as input tokens and carries this adaptive segment representation through deeper layers. Post-processing with pixel-wise clustering methods is no longer needed. Our second insight is to derive fine-to-coarse pixel groupings jointly with local-to-global feature extraction. Given a set of token features, we cluster them into fewer components. The next-level feature is the result of pooling current features within each cluster. Since our input tokens come from segments of an image, feature clustering turns fine-grained segments into coarse-grained regions. By repeating the procedure, we obtain a consistent fine-to-coarse (hierarchical) image segmentation and corresponding feature representations at each level of granularity. We propose to integrate such data-driven perceptual organization into Vision Transformers (Dosovitskiy et al., 2020) . We develop Concurrent recognition and segmentation with Adaptive Segment Tokens (CAST). It has three novel aspects (Fig. 2 ). 1) We use adaptive segment tokens instead of fixed-shape patch tokens. They no longer live on a regular grid, and their shapes and numbers vary with the image. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size



Figure 1: We innovate vision transformer models to concurrently learn image recognition and hierarchical image segmentation from unlabeled images alone. Top: ViT (Dosovitskiy et al., 2020)takes patch tokens as inputs and maintains the same large number of tokens through all encoder blocks. Image segmentation would require additional pixel-wise clustering (e.g. K-Means) on the fixed patch-wise features. Bottom: Our model takes segment tokens as inputs and hierarchically groups them into fewer coarsened region tokens. Unlike patch tokens, these segment tokes adapt to the image and vary in shape. We unify fine-to-coarse feature learning at multiple levels in a single model to support not only recognition with maximum image-wise discrimination, but also segmentation with consistency across the hierarchy. Consequently, we achieve better recognition and segmentation with higher computational efficiency.

