Graph Convolution with Low-rank Learnable Local Filters

Abstract

Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutions provide an appropriate framework to handle such non-Euclidean data, but many of them, particularly those based on global graph Laplacians, lack expressiveness to capture local features required for representation of signals lying on the non-Euclidean grid. The current paper introduces a new type of graph convolution with learnable low-rank local filters, which is provably more expressive than previous spectral graph convolution methods. The model also provides a unified framework for both spectral and spatial graph convolutions. To improve model robustness, regularization by local graph Laplacians is introduced. The representation stability against input graph data perturbation is theoretically proved, making use of the graph filter locality and the local graph regularization. Experiments on spherical mesh data, real-world facial expression recognition/skeleton-based action recognition data, and data with simulated graph noise show the empirical advantage of the proposed model.

1. Introduction

Deep methods have achieved great success in visual cognition, yet they still lack capability to tackle severe geometric transformations such as rotation, scaling and viewpoint changes. This problem is often handled by conducting data augmentations with these geometric variations included, e.g. by randomly rotating images, so as to make the trained model robust to these variations. However, this would remarkably increase the cost of training time and model parameters. Another way is to make use of certain underlying structures of objects, e.g. facial landmarks (Chen et al., 2013) and human skeleton landmarks (Vemulapalli et al., 2014a) , c.f. Fig. 1 (right). Nevertheless, these methods then adopt hand-crafted features based on landmarks, which greatly constrains their ability to obtain rich features for downstream tasks. One of the main obstacles for feature extraction is the non-Euclidean property of underlying structures, and particularly, it prohibits the direct usage of prevalent convolutional neural network (CNN) architectures (He et al., 2016; Huang et al., 2017) . Whereas there are recent CNN models designed for non-Euclidean grids, e.g., for spherical mesh (Jiang et al., 2019; Cohen et al., 2018; Coors et al., 2018) and manifold mesh in computer graphics (Bronstein et al., 2017; Fey et al., 2018) , they mainly rely on partial differential operators which only can be calculated precisely on fine and regular mesh, and may not be applicable to the landmarks which are irregular and course. Recent works have also applied Graph Neural Network (GNN) approaches to coarse non-Euclidean data, yet methods using GCN (Kipf & Welling, 2016) may fall short of model capacity, and other methods adopting GAT (Veličković et al., 2017) are mostly heuristic and lacking theoretical analysis. A detailed review is provided in Sec. 1.1. In this paper, we propose a graph convolution model, called L3Net, originating from lowrank graph filter decomposition, c.f. Fig. 1 (left) . The model provides a unified framework

