LEARNING FROM MULTISCALE WAVELET SUPERPIX-ELS USING GNN WITH SPATIALLY HETEROGENEOUS POOLING

Abstract

Neural networks have become the standard for image classification tasks. On one hand, convolutional neural networks (CNNs) achieve state-of-the-art performance by learning from a regular grid representation of images. On the other hand, graph neural networks (GNNs) have shown promise in learning image classification from an embedded superpixel graph. However, in the latter, studies have been restricted to SLIC superpixels, where 1) a single target number of superpixels is arbitrarily defined for an entire dataset irrespective of differences across images and 2) the superpixels in a given image are of similar size despite intrinsic multiscale structure. In this study, we investigate learning from a new principled representation in which individual images are represented by an image-specific number of multiscale superpixels. We propose WaveMesh, a wavelet-based superpixeling algorithm, where the number and sizes of superpixels in an image are systematically computed based on the image content. We also present WavePool, a spatially heterogeneous pooling scheme tailored to WaveMesh superpixels. We study the feasibility of learning from the WaveMesh superpixel representation using SplineCNN, a state-of-the-art network for image graph classification. We show that under the same network architecture and training settings, SplineCNN with original Graclus-based pooling learns from WaveMesh superpixels on-par with SLIC superpixels. Additionally, we observe that the best performance is achieved when replacing Graclus-based pooling with WavePool while using WaveMesh superpixels.

1. INTRODUCTION

Convolutional neural networks (CNNs) achieve state-of-the-art performance on a variety of image classification tasks from different domains (Tan & Le, 2019; Gulshan et al., 2016) . CNNs learn from a regular pixel-grid representation of the images. Although not all pixels provide equal amount of new information, by design the filters in the first layer of a CNN operate on each pixel from top-left to bottom-right in the same way. Additionally, images are typically resized to a prescribed size before feeding into a CNN. In applications that use standard CNN architectures or pre-trained models on a new image classification dataset, the images are typically uniformly downsampled to meet the input size requirements of the architecture being used. Uniform downsampling may be suboptimal as real data naturally exhibits spatial and multiscale heterogeneity. Few studies have explored the impact of input image resolution on model performance (Sabottke & Spieler, 2020), despite its recognized importance (Lakhani, 2020) . Graph neural network (GNN) is a type of neural network that learns from graph structured data. Recent studies have shown the performance of GNNs on image graph classification tasks (Monti et al., 2017; Fey et al., 2018; Knyazev et al., 2019; Dwivedi et al., 2020) . In this task, a GNN learns to classify images from embedded graphs that represent superpixels in the images. However, prior studies have been restricted to SLIC superpixels (Achanta et al., 2012) . In this framework, a single target number of superpixels is arbitrarily defined for an entire dataset irrespective of differences across images, and the superpixels in a given image are of similar size despite intrinsic multiscale structure. Our proposed approach circumvents these limitations, as shown in Figure 1 . The objectives of our work are twofold. First, we aim to rethink the process of downsampling and/or superpixeling images by introducing a multiscale superpixel representation that can be considered as in between the regular grid and similar-sized superpixel representations. Secondly, we systematically study the feasibility of learning to classify images from embedded graphs that represent the multiscale superpixels. In this context, the contributions of our study are as follows. • We present WaveMesh, an algorithm to superpixel (compress) images in the pixel domain. WaveMesh is based on the quadtree representation of the wavelet transform. Our sample-specific method leads to non-uniformly distributed and multiscale superpixels. The number and size of superpixels in an image are systematically computed by the algorithm based on the image content. WaveMesh requires at most one tunable parameter. • We propose WavePool, a spatially heterogeneous pooling method tailored to WaveMesh superpixels. WavePool preserves spatial structure leading to interpretable intermediate outputs. WavePool generalizes the classical pooling employed in CNNs, and easily integrates with existing GNNs. • To evaluate the WaveMesh representation and the WavePool method for image graph classification, we compare them with SLIC superpixels and graclus-based pooling by conducting several experiments using SplineCNN, a network proposed by Fey et al. (2018) .

2. RELATED WORK

Superpixeling. Grouping pixels to form superpixels was proposed by Ren & Malik (2003) as a preprocessing mechanism that preserves most of the structure necessary for image segmentation. Since then many superpixeling algorithms have been proposed including deep learning based methods (Liu et al., 2011; Li & Chen, 2015; Tu et al., 2018; Giraud et al., 2018; Yang et al., 2020; Zhang et al., 2020) . Graclus-based pooling. Pooling is used in GNNs to coarsen the graph by aggregating nodes within specified clusters. Graclus is a kernel-based multilevel graph clustering algorithm that efficiently clusters nodes in large graphs without any eigenvector computation. Graclus is used in many GNNs to obtain a clustering on the nodes, which is then used by the pooling operator to coarsen the graph



Figure 1: (a) Average distribution of superpixel size averaged across MNIST training dataset for different superpixel representation: none (left), WaveMesh (center), and SLIC (right). In each panel, an insert shows the graph representation of a single sample for illustration. Size of a node in the graph is proportional to the superpixel size. SLIC superpixels are not cubic yet the x-axis binning is chosen to match other plots. (b) Boxplots of the # superpixels per image for CIFAR-10 training dataset.

The SLIC algorithm proposed by Achanta et al. (2012) is based on k-means clustering. GNN for image graph classification. Prior studies have demonstrated the representational power and generalization ability of GNNs on image graph classification tasks using SLIC superpixels. Dwivedi et al. (2020) show that message passing graph convolution networks (GCNs) outperform Weisfeiler-Lehman GNNs on MNIST and CIFAR-10 datasets. Recognizing the importance of spatial and hierarchical structure inherent in images, Knyazev et al. (2019) model images as multigraphs that represent SLIC superpixels computed at different user-defined scales, and then successfully train GNNs on the multigraphs. SplineCNN proposed by Fey et al. (2018) is another network for learning from irregularly structured data. It builds on the work of Monti et al. (2017), but uses a spline convolution kernel instead of Gaussian mixture model kernels.

