LEARNING FROM MULTISCALE WAVELET SUPERPIX-ELS USING GNN WITH SPATIALLY HETEROGENEOUS POOLING

Abstract

Neural networks have become the standard for image classification tasks. On one hand, convolutional neural networks (CNNs) achieve state-of-the-art performance by learning from a regular grid representation of images. On the other hand, graph neural networks (GNNs) have shown promise in learning image classification from an embedded superpixel graph. However, in the latter, studies have been restricted to SLIC superpixels, where 1) a single target number of superpixels is arbitrarily defined for an entire dataset irrespective of differences across images and 2) the superpixels in a given image are of similar size despite intrinsic multiscale structure. In this study, we investigate learning from a new principled representation in which individual images are represented by an image-specific number of multiscale superpixels. We propose WaveMesh, a wavelet-based superpixeling algorithm, where the number and sizes of superpixels in an image are systematically computed based on the image content. We also present WavePool, a spatially heterogeneous pooling scheme tailored to WaveMesh superpixels. We study the feasibility of learning from the WaveMesh superpixel representation using SplineCNN, a state-of-the-art network for image graph classification. We show that under the same network architecture and training settings, SplineCNN with original Graclus-based pooling learns from WaveMesh superpixels on-par with SLIC superpixels. Additionally, we observe that the best performance is achieved when replacing Graclus-based pooling with WavePool while using WaveMesh superpixels.

1. INTRODUCTION

Convolutional neural networks (CNNs) achieve state-of-the-art performance on a variety of image classification tasks from different domains (Tan & Le, 2019; Gulshan et al., 2016) . CNNs learn from a regular pixel-grid representation of the images. Although not all pixels provide equal amount of new information, by design the filters in the first layer of a CNN operate on each pixel from top-left to bottom-right in the same way. Additionally, images are typically resized to a prescribed size before feeding into a CNN. In applications that use standard CNN architectures or pre-trained models on a new image classification dataset, the images are typically uniformly downsampled to meet the input size requirements of the architecture being used. Uniform downsampling may be suboptimal as real data naturally exhibits spatial and multiscale heterogeneity. Few studies have explored the impact of input image resolution on model performance (Sabottke & Spieler, 2020) , despite its recognized importance (Lakhani, 2020) . Graph neural network (GNN) is a type of neural network that learns from graph structured data. Recent studies have shown the performance of GNNs on image graph classification tasks (Monti et al., 2017; Fey et al., 2018; Knyazev et al., 2019; Dwivedi et al., 2020) . In this task, a GNN learns to classify images from embedded graphs that represent superpixels in the images. However, prior studies have been restricted to SLIC superpixels (Achanta et al., 2012) . In this framework, a single target number of superpixels is arbitrarily defined for an entire dataset irrespective of differences across images, and the superpixels in a given image are of similar size despite intrinsic multiscale structure. Our proposed approach circumvents these limitations, as shown in Figure 1 . The objectives of our work are twofold. First, we aim to rethink the process of downsampling and/or superpixeling images by introducing a multiscale superpixel representation that can be considered as in between the regular grid and similar-sized superpixel representations. Secondly, we systematically study the feasibility of learning to classify images from embedded graphs that represent the multiscale superpixels. In this context, the contributions of our study are as follows. • We present WaveMesh, an algorithm to superpixel (compress) images in the pixel domain. WaveMesh is based on the quadtree representation of the wavelet transform. Our sample-specific method leads to non-uniformly distributed and multiscale superpixels. The number and size of superpixels in an image are systematically computed by the algorithm based on the image content. WaveMesh requires at most one tunable parameter. • We propose WavePool, a spatially heterogeneous pooling method tailored to WaveMesh superpixels. WavePool preserves spatial structure leading to interpretable intermediate outputs. WavePool generalizes the classical pooling employed in CNNs, and easily integrates with existing GNNs. • To evaluate the WaveMesh representation and the WavePool method for image graph classification, we compare them with SLIC superpixels and graclus-based pooling by conducting several experiments using SplineCNN, a network proposed by Fey et al. (2018) .

2. RELATED WORK

Superpixeling. Grouping pixels to form superpixels was proposed by Ren & Malik (2003) as a preprocessing mechanism that preserves most of the structure necessary for image segmentation. Since then many superpixeling algorithms have been proposed including deep learning based methods (Liu et al., 2011; Li & Chen, 2015; Tu et al., 2018; Giraud et al., 2018; Yang et al., 2020; Zhang et al., 2020) . The SLIC algorithm proposed by Achanta et al. (2012) is based on k-means clustering. GNN for image graph classification. Prior studies have demonstrated the representational power and generalization ability of GNNs on image graph classification tasks using SLIC superpixels. Dwivedi et al. (2020) show that message passing graph convolution networks (GCNs) outperform Weisfeiler-Lehman GNNs on MNIST and CIFAR-10 datasets. Recognizing the importance of spatial and hierarchical structure inherent in images, Knyazev et al. (2019) model images as multigraphs that represent SLIC superpixels computed at different user-defined scales, and then successfully train GNNs on the multigraphs. SplineCNN proposed by Fey et al. (2018) is another network for learning from irregularly structured data. It builds on the work of Monti et al. (2017) , but uses a spline convolution kernel instead of Gaussian mixture model kernels. Graclus-based pooling. Pooling is used in GNNs to coarsen the graph by aggregating nodes within specified clusters. Graclus is a kernel-based multilevel graph clustering algorithm that efficiently clusters nodes in large graphs without any eigenvector computation. Graclus is used in many GNNs to obtain a clustering on the nodes, which is then used by the pooling operator to coarsen the graph (Defferrard et al., 2016; Monti et al., 2017; Fey et al., 2018) . Hereafter, we refer to pooling based on graclus clustering as graclus-based pooling.

Input images

Graph generation

Superpixel meshes

Filtered images

Wavelet-based quadtree compression

Figure 2 : Filtering images in wavelet space generates non-uniform superpixel meshes that are then represented as embedded graphs. The leftmost images are preprocessed with the method described in section 3 with a threshold value equal to five times the theoretical value. Natural images are from the Pascal dataset (Everingham et al., 2010) , and the medical image is from the NLST dataset. The WaveMesh algorithm is broken down into its elementary steps below: 1) images are wavelet transformed, 2) images are filtered in wavelet space by thresholding the wavelet coefficients, and 3) the superpixel mesh is generated from the wavelet-filter mask. The algorithm is rooted in wavelet theory's seminal work (Mallat, 1989; Donoho & Johnstone, 1994b ). The particular way in which wavelets are used in this work is inspired by their related application in the physical sciences (Schneider & Vasilyev, 2010; Bassenne et al., 2017; 2018) .

3.1. STEP 1: WAVELET TRANSFORMATION OF THE INPUT IMAGE

Consider a two-dimensional (2D) image I discretely described by its pixel values I[x 0 ] centered at locations x 0 = 2 -1 (i∆, j∆) of a N ×N regular grid, where ∆ is the inter-pixel spacing and (i, j) = 1, 3, . . . , 2N -1. A continuous wavelet representation of I is I(x) = x0 I (0) [x 0 ]φ (0) (x -x 0 ), where x is the continuous pixel-space coordinate, and φ 0 (x -x 0 ) are scaling functions that form a orthonormal basis of low-pass filters centered at x 0 , with filter width ∆. The scaling functions have unit energy φ 0 (x -x 0 )φ 0 (x -x 0 ) = 1, where the bracket operator y = 1/(N ∆) 2 y(x)dx denotes the global average for a general 2D continuous field y(x). In practice, when dealing with discrete signals, I (0) [x 0 ] cannot be computed exactly, since I is only known at discrete points x 0 . Instead, it is numerically discretized and the approximation coefficients I (0) [x 0 ] are estimated as an algebraic function of I[x 0 ]. Assuming that φ 0 (x -x 0 ) decays fast away from x = x 0 , we get Addison, 2017) . This estimate for I (0) [x 0 ] is the initialization stage of the recursive wavelet multiresolution algorithm (MRA) of Mallat (1989) , which enables the computation of wavelet coefficients at coarser scales. I (0) [x 0 ] = I[x 0 ]/N ( The decomposition of the finest-scale low-pass filter φ 0 (x -x 0 ) in terms of narrow-band wavelet filters ψ (s,d) (x -x s ) with increasingly large filter width and a coarsest-scale scaling function φ (S) (x -x S ) yields the full wavelet-series expansion of I, I(x) = S s=1 xs 3 d=1 I (s,d) [x s ]ψ (s,d) (x -x s ) + I (S) [x S ]φ (S) (x -x S ). (1) Here, I (s,d) [x s ] = I(x)ψ (s,d) (x -x s ) and I (S) [x S ] = I(x)φ (S) (x -x S ) are wavelet and approximation coefficients at scale s and S, respectively, obtained from the orthonormality properties of the wavelet and scaling functions. In this formulation, d = (1, 2, 3) is a wavelet directionality index, and s = (1, 2 . . . , S) is a scale exponent, with S = log 2 N the number of resolution levels allowed by the grid (5 for 32×32 images). Similarly, x s = 2 s-1 (i∆, j∆) is a scale-dependent wavelet grid of (N/2 s )×(N/2 s ) elements where the basis functions are centered, with i, j = 1, 3, . . . , N/2 s-1 -1. The wavelet coefficients represent the local fluctuations of I centered at x s at scale s, while the approximation coefficient is proportional to the global mean of I. At each scale, the filter width of the wavelets is 2 s ∆. In this study, the 2D orthonormal basis functions ψ (s,d) (x -x s ) are products of one-dimensional (1D) Haar wavelets (Meneveau, 1991) . The definition of 2D wavelets as multiplicative products of 1D wavelets is a particular choice that follows the MRA formulation (Mallat, 1989) . Haar wavelets have a narrow spatial support that provides a high degree of spatial localization. However, they display large spectral leakage at high wavenumbers since infinite spectral and spatial resolutions cannot be simultaneously attained due to limitations imposed by the uncertainty principle (Addison, 2017) . Different boundary conditions can be assumed for the field I. We do not require such a choice in this study as we restrict ourselves to square images. However, the wavelet MRA framework is not limited to square inputs and can be generalized to rectangular inputs (Addison, 2017; Kim et al., 2018) . The definition of 2D wavelets as multiplicative products of 1D wavelets is a particular choice that follows the MRA formulation described by Mallat (1989) , in which, the multivariate wavelets are characterized by an isotropic scale and therefore render limited information about anisotropy in the image. A large number of alternative basis functions have been recently proposed for replacing traditional wavelets when analyzing multi-dimensional data that exhibit complex anisotropic structures such as filaments and sheets. These include, but are not limited to, curvelets, contourlets, and shearlets (see Kutyniok & Labate (2012) for an extensive review on this topic).

3.2. STEP 2: FILTERING OF THE IMAGE IN WAVELET SPACE

The second step decomposes I as I = I > + I ≤ , where the filtered I > and remainder I ≤ components correspond to the highest and lowest energetic wavelet modes of I, respectively. By construction, these two components are not spatially crosscorrelated, as implied by the orthogonality of the wavelets and by the filtering operation described below. Note that large wavelet coefficients are associated with large fluctuations within the corresponding region of the scale-dependent wavelet grid x s , these being markers of underlying coherent structures. Under the assumptions that I ≤ is additive Gaussian white noise, Donoho & Johnstone (1994a) described a wavelet-based algorithm that is optimal for achieving the target decomposition (2), since it minimizes the maximum L 2 -estimation error of I > . I > is obtained by retaining only the wavelet coefficients I (s,d) whose absolute values satisfy I (s,d) > (x s ) = I (s,d) (x s ) if |I (s,d) (x s )| ≥ T, 0 otherwise, for all scales s, positions x s and directions d. In Equation 3, T is a theoretical threshold defined as T = 2σ 2 I ≤ ln N 2 , ( ) where σ 2 I ≤ is the unknown variance of I ≤ . In this study, the iterative method of Azzalini et al. (2005) is employed, which converges to T starting from a first iteration where σ 2 I ≤ in Equation 4 is substituted by the variance σ 2 I of the total image I. This iterative procedure does not introduce significant computational overhead, since only one wavelet transform is required independently of the number of iterations. The algorithm does not introduce any hyperparameter when the theoretical threshold value is used. Note that the threshold is image-dependent, thereby ensuring that the algorithm adapts the number of superpixels to each image appropriately. The above filtering operation is equivalent to applying a binary filter mask to wavelet coefficients, denoted as wavelet-filter mask below. The iterative method is deemed as converged when the relative variation in the estimated threshold T is less than 0.1% across consecutive iterations. A maximum of O(10) iterations were required to obtain the results presented below. The overall computational cost is O(n i M ), where n i is the number of iterations and M is the number of pixels in the image (Azzalini et al., 2005) . In this work, we allow for further reduction in number of superpixels by varying the threshold T to take larger values. Figure 2 illustrates the application of this wavelet filtering method on four images, wherein for RGB images filtering is applied to each channel independently. Most of the structural and edge information is preserved at all scales. However, a drawback of the method is that the superpixel boundaries are necessarily regular and axis-aligned. To generate superpixels for a given image, the final step is a grid adaptation based on the wavelet-filter mask described in subsection 3.2. The result is a non-uniform grid of multiscale superpixels adapted around regions of the image with high variability.

3.3. STEP 3: GENERATING THE SUPERPIXEL MESH FROM THE WAVELET-FILTER MASK

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 1 2 2 3 (a) Quadtree representation. The algorithm is perhaps best understood by representing the wavelet coefficients in a quadtree (Finkel & Bentley, 1974) , a tree data structure in which each node has exactly four children. A quadtree-based representation of wavelet coefficients was previously shown to be an efficient data structure for wavelet-based image compression (Banham & Sullivan, 1992; Wakin et al., 2003) . Here, the height of this quadtree equals the number of decomposition levels S in the wavelet transform. Each vertex at a given level s is associated with a triplet of wavelet coefficients [I (s,d=0) (x s ), I (s,d=1) (x s ), I (s,d=2) (x s )]. All vertices from a given level correspond to wavelet coefficients across all locations at a given scale. The children vertices of a root vertex are the wavelet coefficients from that region in space at smaller scales. The quadtree representation of the wavelet coefficients of an 8×8 image is schematically represented in Figure 3(a) . The number on each vertex indicates the scale, from the smallest scale s=1 associated with 2×2 pixel patches up to the largest scale s=3 associated with the entire 8×8 image. The pixel regions associated with each wavelet coefficient are delineated by solid lines in the three leftmost figures in Figure 3(c) . Node tagging. The vertices in the tree are tagged according to the filtering algorithm described in subsection 3.2. The tagged elements of the tree denoted by blue filled color in Figure 3(a, b ) correspond to those with absolute values larger than the threshold T , and therefore correspond to locations in the image with important spatial variability. In the 2D case, tagging is applied if at least one of the 3 wavelet coefficients of I per location is larger than the threshold. Additional tagging by green-filled color is applied to wavelet coefficients that are smaller than the threshold T but that correspond to a spatial region with at least one tagged wavelet coefficient at a smaller scale. This corresponds to tagging all the ancestors of previously tagged vertices. This tagging procedure enforces cubic superpixels by ensuring that when there is a coherent structure at scale s but not at a larger scale s+1, the wavelet coefficient at scale s+1 at that location are also tagged, hence triggering local grid refinement at level s+1. Non-tagged vertices are pruned as shown in Figure 3(b) . Mesh generation. Starting from the coarsest possible wavelet grid x s = x S that contains just one superpixel, the algorithm adapts the grid by recursively splitting it as follows. If the wavelet coefficient corresponding to a region is tagged, then that region is split into 2×2 superpixels, which locally refines the grid. The algorithm is stopped otherwise. The same recursive loop is then applied to the refined superpixels. The final configuration of the adapted grid is obtained when none of the wavelet coefficients in any the superpixels are tagged. An example of final adapted grid is shown in Figure 3(c ). The dashed lines correspond to the superpixel refinement due to the vertex being tagged. Adapted grids from real images are shown in Figure 2 where the superpixel meshes exhibit desired level of heterogeneity with multiscale refinement around edges. For RGB images, the most restrictive mesh is employed at every location and scale. In other words, tagging for the full image is applied if at least of the channels is tagged. The proposed spatially heterogeneous pooling, WavePool, is best explained using the wavelet coefficient quadtree representation described in subsection 3.3. One WavePool operation consists in aggregating all the leaf nodes of the wavelet quadtree. In the pixel domain, this step essentially corresponds to merging patches of 2×2 superpixels into a parent superpixel, and aggregating the node features with a choice of pooling function (mean or max typically). Figure 4 illustrates WavePool on a simple superpixel mesh and its effect on both the quadtree (Figure 4 upper panel) and region adjacency graph (Figure 4 lower panel) representation. In a region adjacency graph (RAG), nodes correspond to superpixels, and edges connect neighboring superpixels. We show RAG in Figure 4 because we train GNNs to learn from embedded RAGs. RAG is not a tree and should not be confused with the wavelet coefficient quadtree.

4. WAVEPOOL: SPATIALLY HETEROGENEOUS POOLING

By construction, WavePool generalizes the classical CNN pooling operation. For a regular superpixel grid as in Figure 5 , WavePool exactly matches the conventional 2×2 pooling in CNN. Although more general than its CNN counterpart, WavePool is restricted to WaveMesh superpixels or more broadly to any quadtree based superpixel representation (Tanimoto & Pavlidis (1975) ; Zhang et al. (2018) ), unlike graclus-based pooling. However, graclus-based pooling does not converge to CNN pooling even when applied to regular superpixel grids as shown in Figure 5 .

5. EXPERIMENTS AND RESULTS

DATASETS. We performed image graph classification experiments on SLIC and WaveMesh superpixels from 3 datasets: MNIST, Fashion-MNIST, and CIFAR-10 (LeCun et al., 1998; Xiao et al., 2017; Krizhevsky et al., 2009) . We represent superpixels by embedded region adjacency graphs (RAG), where nodes correspond to superpixels, and edges connect neighboring superpixels. Node embeddings are mean intensity of superpixels. Edges in the graph are directed with pseudo-coordinates as in Fey et al. (2018) . For more details on the datasets, refer to subsection A.1.

EXPERIMENTAL SETTINGS.

We conduct experiments on two configurations based on a SplineCNN implementation available in PyTorch Geometric (Fey & Lenssen, 2019) . The configurations are: 1. SplineConv((3, 3), 1, 32) -> Pool -> SplineConv((3, 3), 32, 64) -> Pool -> Global mean pool -> FC(128) -> FC(10). This configuration has 30506 parameters. 2. SplineConv((3, 3), 1, 32) -> Pool -> SplineConv((3, 3), 32, 64) -> Pool -> SplineConv((3, 3), 64, 128) -> Pool -> Global mean pool -> FC(256) -> FC(10). This configuration has 139178 parameters. Through the experiments we aim to: 1) Compare how SplineCNN performs on SLIC and WaveMesh superpixels under the same network architecture and training settings; 2) Understand the effect of WavePool on learning from WaveMesh superpixels, everything else being the same. The PyTorch Geometric implementation of SplineCNN uses Adam optimizer with an initial learning rate of 0.01, which is decreased by a factor of 10 after 15 and 25 epochs. Since the goal of our experiments is not tune the best model for WaveMesh superpixel representation, we conducted all experiments without any hyperparameter tuning. We use the same training settings and train the network for 30 epochs on MNIST and Fashion-MNIST, and for 75 epochs on CIFAR-10. The pooling function is max for both WavePool and graclus-based pooling. All experiments are repeated 5 times, and the mean train and test accuracy are reported along with the standard deviation. Figure 6 is a visual illustration of a WaveMesh superpixel graph passing through SplineCNN network with WavePool. MNIST. Results on the MNIST dataset from our experiments and prior work are shown in Table 1 . We report the mean and standard deviation values for train and test accuracy, and precision. We didn't include recall since averaged one-versus-all recall and accuracy are equal for balanced datasets. Experiments 1-4 uses WaveMesh superpixels obtained using the theoretical threshold T as described in subsection 3.2. Experiments 3 and 4 are same as 1 and 2, but uses a network (config 2) with more parameters. Across these four experiments we first observe that SplineCNN is successful in learning to classify images from the WaveMesh representation. We also observe that the network with WavePool performs better than the one with graclus-based pooling. Experiments 7-10 are same as 1-4 but with lesser WaveMesh superpixels. These experiments were done to compare with experiments 13-14 that report results from prior work on SLIC superpixels where each image has exactly 75 superpixels. To reduce the number of WaveMesh superpixels in an image to about 75, we increased the theoretical threshold T in our algorithm. From the results for experiments 7-10, we observe that the network with WavePool performs similar to or better than graclus-based pooling. Experiments 5-6 and 11-12 are on SLIC superpixels that we generated using the scikit-learn package. Comparing experiments 1-4 with 5 and 6, and 7-10 with 11 and 12, we observe that SplineCNN learns just as well or better from WaveMesh superpixels. FASHION-MNIST AND CIFAR-10. Experiments similar to MNIST were performed on these two datasets. Results are reported in Tables 2 and 3 . For both datasets, experiments 1-4 were performed on WaveMesh superpixels obtained using the theoretical threshold T in our algorithm. In superpixels on-par with SLIC superpixels. Additionally, under the same settings, we observe that the best performance is achieved when replacing graclus-based pooling with WavePool while using WaveMesh superpixels. This is shown in Figure 7 for all three datasets. We believe this increase in performance is because WavePool accounts for spatial heterogeneity while aggregating nodes. Overall, we conclude that WaveMesh is a reasonable sample-specific multiscale superpixeling method.

6. CONCLUSION

Over the last 5 years powerful GNNs have been developed for a variety of tasks on graph structured data. Nonetheless, for image graph classification tasks, GNN studies have been restricted to graphs that model a regular grid or similar-sized SLIC superpixel representations. Looking at images through the lens of GNNs enables rethinking the process of downsampling, and offers new possibilities for image representations that explore the landscape between the regular grid and similar-sized superpixel representations. Towards this goal, we introduced WaveMesh, a superpixeling algorithm that computes spatially heterogeneous superpixels of varying sizes within an image. We also proposed WavePool, a new pooling scheme tailored to WaveMesh superpixels. We investigate the performance of both methods across three benchmark datasets. Our experiments comparing WaveMesh superpixels with SLIC superpixels and WavePool with graclus-based pooling demonstrated promising results. Multiscale spatially heterogeneous superpixels warrant further attention. As a future direction, we encourage researchers to benchmark GNN models on WaveMesh superpixels and explore architectures custom to WaveMesh superpixels. In particular, we envision greater interest in this direction of research from the medical machine learning community where high resolution images are ubiquitous (see subsection A.4).

A APPENDIX

A.1 DATASETS 1. MNIST: 28×28 grayscale images, 60k train and 10k test, 10 categories. 2. Fashion-MNIST: 28×28 grayscale images, 60k train and 10k test, 10 categories. 3. CIFAR-10: 32×32 color images, 50k train and 10k test, 10 categories. SLIC superpixels were generated using the scikit-image library by setting the compactness parameter to 0.25 (van der Walt et al., 2014) . A small value of compactness parameter was chosen to ensure that superpixels shapes are not all square. While computing WaveMesh superpixels MNIST and Fashion-MNIST images are padded with zeros to make them 32×32. See Figure 8 for examples of region adjacency graphs generated from CIFAR-10 images using WaveMesh superpixels. MNIST. In Figure 9 , the confusion matrix looks good for all the experiments as most images are on the main diagonal. The actual class is along the rows and predicted class is along the columns. In Figure 9 , the error rate matrix for each experiment in the right quadrant is obtained by dividing each value in the confusion matrix by the number of images in the corresponding class, and by filling the main diagonal with zeros. • SLIC with Graclus. Comparing the error rate matrices of experiments 6 and 12, in both cases many digit 4 images are being misclassified as 9, and as the number of superpixel reduces, many digit 7 images are also being misclassified as 9. • WaveMesh with WavePool. Comparing the error rate matrices of experiments 6 and 12In experiment 4 many digit 4 images are wrongly classified as 9. However, in experiment 10, with the number of superpixels equal to one-fifth of experiment 4, many 9s are being misclassified as 4 and many 3s are being misclassified as 2. FASHION-MNIST. In Figure 10 , • from the confusion matrices, we can conclude that the model performs best in classifying images from the classes trouser, sandal, sneaker, bag, ankleboot. This is true both with WaveMesh and SLIC superpixels. • from the error rate matrices we can conclude that shirt is getting misclassified the most. Also, the columns for classes shirt, coat and pullover are quite bright, indicating that many images are getting misclassified into these classes. 2 . Comparing this matrix with that of experiment 8 from Figure 10 , we observe that many more images of pullover and coat are getting misclassified as shirt in this experiment when compared to experiment 8. Overall, from the error analysis for MNIST and Fashion-MNIST, we observe that misclassification patterns are not very different for WaveMesh and SLIC superpixels.



Figure 1: (a) Average distribution of superpixel size averaged across MNIST training dataset for different superpixel representation: none (left), WaveMesh (center), and SLIC (right). In each panel, an insert shows the graph representation of a single sample for illustration. Size of a node in the graph is proportional to the superpixel size. SLIC superpixels are not cubic yet the x-axis binning is chosen to match other plots. (b) Boxplots of the # superpixels per image for CIFAR-10 training dataset.

Figure3: Illustration of the wavelet-based quadtree compression algorithm for an 8×8 image, along with the resulting adapted grid. Starting from the coarsest possible wavelet grid that contains just one superpixel, the algorithm adapts the grid by recursively splitting it. If the wavelet coefficient corresponding to a region is tagged (denoted by blue color), then that region is split into 2×2 superpixels.

Figure 4: Illustration of WavePool from wavelet quadtree representation. Leaf nodes (2×2 superpixels) are recursively pooled. In the lower panel, dashed squares and lines correspond to nodes and edges in the region adjacency graph (RAG) representation of the superpixel mesh, respectively.

Figure 5: Illustration of WavePool versus graclus-based pooling on a regular superpixel grid.

Figure 7: Mean test accuracy versus mean number of superpixels for all three datasets for both network configuration. The plot compares the accuracy of WaveMesh and WavePool combination with SLIC and Graclus combination.

Figure 8: WaveMesh superpixels represented by region adjacency graphs (RAG). (a) Images from the CIFAR-10 dataset. (b) RAGs representing WaveMesh superpixels obtained using the theoretical threshold T for images shown in (a). (c) RAGs representing WaveMesh superpixels obtained using a threshold equal to 2T . Size of nodes in the graph are proportional to the corresponding superpixel size.

Figure 10: FashionMNIST error analysis: The top row corresponds to experiments on WaveMesh superpixels with WavePool in config 2. The bottom row corresponds to experiments on SLIC superpixels with Graclus pooling in config 2. Left quadrant shows the confusion matrix and the right quadrant shows the error rate matrix averaged over all runs of an experiment. Experiment numbers from Table 2 are indicated below each matrix.

Figure 11: Error rate matrix for experiment 4 from Table2. Comparing this matrix with that of experiment 8 from Figure10, we observe that many more images of pullover and coat are getting misclassified as shirt in this experiment when compared to experiment 8.

Results on MNIST dataset. Mean±SD (%) are reported for each evaluation metric.

Results on Fashion-MNIST. Mean±SD (%) are reported for each evaluation metric.

Table 3, experiments 7-9 report results from(Dwivedi et al., 2020), where RingGNN and Gated GCN perform the worst and best. Results for MoNet are shown because SplineCNN builds on the work of MoNet. For the case of Fashion-MNIST, experiments 1-4 with more superpixels have similar train accuracy as in experiments 5-8. However, the trained model in experiments 1-4 performed poorly on test data when compared to experiments 5-8. It is unclear why this happened. More detailed error analysis is included in the subsection A.3.From our experiments on 3 benchmark datasets, we observe that under the same network architecture and training settings, SplineCNN with original graclus-based pooling learns from WaveMesh Results on CIFAR-10. Experiments 7-9 report min and max number of nodes in column 3, and the number of parameters in the model in column 4(Dwivedi et al., 2020).

annex

Under review as a conference paper at ICLR 2021 A.4 WAVEMESH APPLIED TO MEDICAL IMAGES Figure 12 : Left: Chest X-ray image of size 1024×1024. Center: WaveMesh superpixel mesh. Right: Wavelet filtered chest X-ray image.The chest X-ray image shown in Figure 12 is from the NIH chest X-ray dataset (Wang et al., 2017) . The image has 1024×1024 pixels. These X-ray images are typically downsampled to 256×256 before using them for training a CNN model. In the center in Figure 12 , we show the WaveMesh superpixel mesh obtained using our algorithm. It has 3843 multiscale superpixels. We note that 3843 < 256 2 < 1024 2 . Infact, the compression ratio is 1 : 17 : 273.

