Image as Set of Points

Abstract

What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution-and attention-free, and only rely on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better results than ConvNets or ViTs on several benchmarks. Codes are available at: https://github.com/ma-xu/Context-Cluster.

1. Introduction

The way we extract features depends a lot on how we interpret an image. As a fundamental paradigm, Convolutional Neural Networks (ConvNets) have dominated the field of computer vision and considerably improved the performance of various vision tasks in recent years (He et al., 2016; Xie et al., 2021; Ge et al., 2021) . Methodologically, ConvNets conceptualize a picture as a collection of arranged pixels in a rectangle form, and extract local features using convolution in a sliding window fashion. Benefiting from some important inductive biases like locality and translation equivariance, ConvNets are made to be efficient and effective. Recently, Vision Transformers (ViTs) have significantly challenged ConvNets' hegemony in the vision domain. Derived from language processing, Transformers (Vaswani et al., 2017) treat an image as a sequence of patches, and a global-range self-attention operation is employed to adaptively fuse information from patches. With the resulting models (i.e., ViTs), the inherent inductive biases in ConvNets are abandoned, and gratifying results are obtained (Touvron et al., 2021) . Recent work has shown tremendous improvements in vision community, which are mainly built on top of convolution or attention (e.g., ConvNeXt (Liu et al., 2022) , MAE (He et al., 2022), and CLIP (Radford et al., 2021) ). Meanwhile, some attempts combine convolution and attention together, like CMT (Guo et al., 2022a) and CoAtNet (Dai et al., 2021) . These methods scan images in grid (convolution) yet explore mutual relationships of a sequence (attention), enjoying locality prior (convolution) without sacrificing global reception (attention). While they inherit the advantages from both and achieve better empirical performance, the insights and knowledge are still restricted to ConvNets and ViTs. Instead of being lured into the trap of chasing incremental improvements, we underline that some feature extractors are also worth investigating beyond convolution and attention. While convolution and attention are acknowledged to have significant benefits and an enormous influence on the field of vision, they are not the only choices. MLP-based architectures (Touvron et al., 2022; Tolstikhin et al., 2021) have demonstrated that a pure MLP-based design can also achieve similar performance. Besides, considering graph network as the feature extractor is proven to be feasible (Han et al., 2022) . Hence, we expect a new paradigm of feature extraction that can provide some novel insights instead of incremental performance improvements. In this work, we look back into the classical algorithm for the fundamental visual representation, clustering method (Bishop & Nasrabadi, 2006) . Holistically, we view an image as a set of data points and group all points into clusters. In each cluster, we aggregate the points into a center and then dispatch the center point to all the points adaptively. We call this design context cluster. Fig. 1 illustrates the process. Specifically, we consider each pixel as a 5-dimensional data point with the information of color and position. In a sense, we convert an image as a set of point clouds and utilize methodologies from point cloud analysis (Qi et al., 2017b; Ma et al., 2022) for image visual representation learning. This bridges the representations of image and point cloud, showing a strong generalization and opening the possibilities for an easy fusion of multi-modalities. With a set of points, we introduce a simplified clustering method to group the points into clusters. The clustering processing shares a similar idea as SuperPixel (Ren & Malik, 2003) , where similar pixels are grouped, but they are fundamentally different. To our best knowledge, we are the first ones to introduce the clustering method for the general visual representation and make it work. On the contrary, SuperPixel and later versions are mainly for image pre-processing (Jampani et al., 2018) or particular tasks like semantic segmentation (Yang et al., 2020; Yu et al., 2022b) . We instantiate our deep network based on the context cluster and name the resulting models as Context Clusters (CoCs). Our new design is inherently different from ConvNets or ViTs, but we also inherit some positive philosophy from them, including the hierarchical representation (Liu et al., 2022) from ConvNets and Metaformer (Yu et al., 2022c) framework from ViTs. CoCs reveal distinct advantages. First, by considering image as a set of points, CoCs show great generalization ability to different data domains, like point clouds, RGBD images, etc. Second, the context clustering processing provides CoCs gratifying interpretability. By visualizing the clustering in each layer, we can explicitly understand the learning in each layer. Even though our method does not target SOTA performance, it still achieves on par or even better performance than ConvNets or ViTs on several benchmarks. We hope our context cluster will bring new breakthroughs to the vision community.

2. Related Work

Clustering in Image Processing While clustering approaches in image processing (Castleman, 1996) have gone out of favor in the deep learning era, they never disappear from computer vision. A time-honored work is SuperPixel (Ren & Malik, 2003) , which segments an image into regions by grouping a set of pixels that share common characteristics. Given the desired sparsity and simple representation, SuperPixel has become a common practice for image preprocessing. Naive application of SuperPixel exhaustively clusters (e.g., via K-means algorithm) pixels over the entire image, making the computational cost heavy. To this end, SLIC (Achanta et al., 2012) limits the clustering operation in a local region and evenly initializes the K-means centers for better and faster convergence. In recent years, clustering methods have been experiencing a surge of interest and are closely bound with deep networks (Li & Chen, 2015; Jampani et al., 2018; Qin et al., 2018; Yang et al., 2020) . To create the superpixels for deep networks, SSN (Jampani et al., 2018) proposes a differentiable SLIC method, which is end-to-end trainable and enjoys favorable runtime. Most recently, tentative efforts have been made towards applying clustering methods into networks for specific vision tasks, like segmentation (Yu et al., 2022b; Xu et al., 2022) and fine-grained recognition (Huang & Li, 2020) . For example, CMT-DeepLab (Yu et al., 2022a) interprets the object queries in segmentation task as



Figure 1: A context cluster in our network trained for image classification. We view an image as a set of points and sample c centers for points clustering. Point features are aggregated and then dispatched within a cluster. For cluster center C i , we first aggregated all points {x 0 i , x 1 i , • • • , x n i } in ith cluster, then the aggregated result is distributed to all points in the clusters dynamically. See § 3 for details.

