SPATIO-TEMPORAL GRAPH SCATTERING TRANSFORM

Abstract

Although spatio-temporal graph neural networks have achieved great empirical success in handling multiple correlated time series, they may be impractical in some real-world scenarios due to a lack of sufficient high-quality training data. Furthermore, spatio-temporal graph neural networks lack theoretical interpretation. To address these issues, we put forth a novel mathematically designed framework to analyze spatio-temporal data. Our proposed spatio-temporal graph scattering transform (ST-GST) extends traditional scattering transforms to the spatiotemporal domain. It performs iterative applications of spatio-temporal graph wavelets and nonlinear activation functions, which can be viewed as a forward pass of spatio-temporal graph convolutional networks without training. Since all the filter coefficients in ST-GST are mathematically designed, it is promising for the real-world scenarios with limited training data, and also allows for a theoretical analysis, which shows that the proposed ST-GST is stable to small perturbations of input signals and structures. Finally, our experiments show that i) ST-GST outperforms spatio-temporal graph convolutional networks by an increase of 35% in accuracy for MSR Action3D dataset; ii) it is better and computationally more efficient to design the transform based on separable spatio-temporal graphs than the joint ones; and iii) the nonlinearity in ST-GST is critical to empirical performance.

1. INTRODUCTION

Processing and learning from spatio-temporal data have received increasing attention recently. Examples include: i) skeleton-based human action recognition based on a sequence of human poses (Liu et al. (2019) ), which is critical to human behavior understanding (Borges et al. (2013) ), and ii) multi-agent trajectory prediction (Hu et al. (2020) ), which is critical to robotics and autonomous driving (Shalev-Shwartz et al. (2016) ). A common pattern across these applications is that data evolves in both spatial and temporal domains. This paper aims to analyze this type of data by developing novel spatio-temporal graph-based data modeling and operations. Spatio-temporal graph-based data modeling. Graphs are often used to model data where irregularly spaced samples are observed over time. Good spatio-temporal graphs can provide informative priors that reflect the internal relationships within data. For example, in skeleton-based human action recognition, we can model a sequence of 3D joint locations as data supported on skeleton graphs across time, which reflects both the human physical constraints and temporal consistency (Yan et al. (2018) ). Recent studies on modeling spatio-temporal graphs have followed either joint or separable processing frameworks. Joint processing is based on constructing a single spatio-temporal graph and processing (e.g., filtering) via operations on this graph (Kao et al. (2019); Liu et al. (2020) ). In contrast, a separable processing approach works separately, and possibly with different operators, across the space and time dimension. In this case, independent graphs are used for space and time (Yan et al. (2018); Cheng et al. (2020) ). However, no previous work thoroughly analyzes and compares these two constructions. In this work, we mathematically study these two types of graphs and justify the benefits of separable processing from both theoretical and empirical aspects. Spatio-temporal graph-based operations. Graph operations can be performed once the graph structure is given. Some commonly used graph operations include the graph Fourier transform (Shuman et al. ( 2013 2020)). However, most networks are designed through trial and error. It is thus hard to explain the rationale behind empirical success and further improve the designs (Monga et al. ( 2019)). In this work, to fill in the gap between mathematically designed linear transforms and trainable spatio temporal graph neural networks, we propose a novel spatio-temporal graph scattering transform (ST-GST), which is a mathematically designed, nonlinear operation. Specifically, to characterize the spatial and temporal dependencies, we present two types of graphs corresponding to joint and separable designs. We then construct spatio-temporal graph wavelets based on each of these types of graphs. We next propose the framework of ST-GST, which adopts spatio-temporal graph wavelets followed by a nonlinear activation function as a single scattering layer. All the filter coefficients in ST-GST are mathematically designed beforehand and no training is required. We further show that i) a design based on separable spatio-temporal graph is more flexible and computationally efficient than a joint design; and ii) ST-GST is stable to small perturbations on both input spatio-temporal graph signals and structures. Finally, our experiments on skeletonbased human action recognition show that the proposed ST-GST outperforms spatio-temporal graph convolutional networks by 35% accuracy in MSR Action3D dataset. We summarize the main contributions of this work as follows: • We propose wavelets for both separable and joint spatio-temporal graphs. We show that it is more flexible and computationally efficient to design wavelets based on separable spatio-temporal graphs; • We propose a novel spatio-temporal graph scattering transform (ST-GST), which is a non-trainable counterpart of spatio-temporal graph convolutional networks and a nonlinear version of spatiotemporal graph wavelets. We also theoretically show that ST-GST is robust and stable in the presence of small perturbations on both input spatio-temporal graph signals and structures; • For skeleton-based human action recognition, our experiments show that: i) ST-GST can achieve similar or better performances than spatio-temporal graph convolutional networks and other nondeep-learning approaches in small-scale datasets; ii) separable spatio-temporal scattering works significantly better than joint spatio-temporal scattering; and iii) ST-GST significantly outperforms spatio-temporal graph wavelets because of the nonlinear activation function. 



)), and graph wavelets(Hammond et al. (2011)). It is possible to extend those operations to the spatio-temporal graph domain. For example, Grassi et al. (2017) developed the short time-vertex Fourier transform and spectrum-based time-vertex wavelet transform. However, those mathematically designed, linear operations show some limitations in terms of empirical performances. In comparison, many recent deep neural networks adopt trainable graph convolution operations to analyze spatio-temporal data (Yan et al. (2018); Liu et al. (

Scattering transforms. Convolutional neural networks (CNNs) use nonlinearities coupled with trained filter coefficients and are well known to be hard to analyze theoretically (Anthony & Bartlett (2009)). As an alternative, Mallat (2012); Bruna & Mallat (2013) propose scattering transforms, which are non-trainable versions of CNNs. Under admissible conditions, the resulting transform enjoys both great performance in image classification and appealing theoretical properties. These ideas have been extended to the graph domain (Gama et al. (2019a); Zou & Lerman (2020); Gao et al. (2019); Ioannidis et al. (2020)). Specifically, the graph scattering transform (GST) proposed in (Gama et al. (2019a)) iteratively applies predefined graph filter banks and element-wise nonlinear activation function. In this work, we extend classical scattering transform to the spatio-temporal domain and provide a new mathematically designed transform to handle spatio-temporal data. The key difference between GST and our proposed ST-GST lies in the graph filter bank design, where ST-GST needs to handle both spatial and temporal domains. Spatio-temporal neural networks. Deep neural networks have been adapted to operate on spatiotemporal domain. For example, Liu et al. (2019) uses LSTM to process time series information, while ST-GCN (Yan et al. (2018)) combines a graph convolution layer and a temporal convolution

