PSTNET: POINT SPATIO-TEMPORAL CONVOLUTION ON POINT CLOUD SEQUENCES

Abstract

Point cloud sequences are irregular and unordered in the spatial dimension while exhibiting regularities and order in the temporal dimension. Therefore, existing grid based convolutions for conventional video processing cannot be directly applied to spatio-temporal modeling of raw point cloud sequences. In this paper, we propose a point spatio-temporal (PST) convolution to achieve informative representations of point cloud sequences. The proposed PST convolution first disentangles space and time in point cloud sequences. Then, a spatial convolution is employed to capture the local structure of points in the 3D space, and a temporal convolution is used to model the dynamics of the spatial regions along the time dimension. Furthermore, we incorporate the proposed PST convolution into a deep network, namely PSTNet, to extract features of point cloud sequences in a hierarchical manner. Extensive experiments on widely-used 3D action recognition and 4D semantic segmentation datasets demonstrate the effectiveness of PSTNet to model point cloud sequences.

1. INTRODUCTION

Modern robotic and automatic driving systems usually employ real-time depth sensors, such as LiDAR, to capture the geometric information of scenes accurately while being robust to different lighting conditions. A scene geometry is thus represented by a 3D point cloud, i.e., a set of measured point coordinates {(x, y, z)}. Moreover, when RGB images are available, they are often used as additional features associated with the 3D points to enhance the discriminativeness of point clouds. However, unlike conventional grid based videos, dynamic point clouds are irregular and unordered in the spatial dimension while points are not consistent and even flow in and out over time. Therefore, existing 3D convolutions on grid based videos (Tran et al., 2015; Carreira & Zisserman, 2017; Hara et al., 2018) are not suitable to model raw point cloud sequences, as shown in Fig. 1 . To model the dynamics of point clouds, one solution is converting point clouds to a sequence of 3D voxels, and then applying 4D convolutions (Choy et al., 2019) to the voxel sequence. However, directly performing convolutions on voxel sequences require a large amount of computation. Furthermore, quantization errors are inevitable during voxelization, which may restrict applications that require precise measurement of scene geometry. Another solution MeteorNet (Liu et al., 2019e) is extending the static point cloud method PointNet++ (Qi et al., 2017b) to process raw point cloud sequences by appending 1D temporal dimension to 3D points. However, simply concatenating coordinates and time together and treating point cloud sequences as unordered 4D point sets neglect the temporal order of timestamps, which may not properly exploit the temporal information and lead to inferior performance. Moreover, the scales of spatial displacements and temporal differences in point cloud sequences may not be compatible. Treating them equally is not conducive for network optimization. Besides, MeteorNet only considers spatial neighbors and neglects the local dependencies of neighboring frames. With the use of whole sequence length as its temporal receptive field, MeteorNet cannot construct temporal hierarchy. As points are not consistent and even flow in and out of the region, especially for long sequences and fast motion embedding points in a spatially local area along an entire sequence handicaps capturing accurate local dynamics of point clouds. In this paper, we propose a point spatio-temporal (PST) convolution to directly process raw point cloud sequences. As dynamic point clouds are spatially irregular but ordered in the temporal dimension, we decouple the spatial and temporal information to model point cloud sequences. Specifically, PST convolution consists of (i) a point based spatial convolution that models the spatial structure of 3D points and (ii) a temporal convolution that captures the temporal dynamics of point cloud sequences. In this fashion, PST convolution significantly facilitates the modeling of dynamic point clouds and reduces the impact of the spatial irregularity of points on temporal modeling. Because point cloud sequences emerge inconsistently across frames, it is challenging to perform convolution on them. To address this problem, we introduce a point tube to preserve the spatio-temporal local structure. To enhance the feature extraction ability, we incorporate the proposed PST convolution into a spatiotemporally hierarchical network, namely PSTNet. Moreover, we extend our PST convolution to a transposed version to address point-level prediction tasks. Different from the convolutional version, the PST transposed convolution is designed to interpolate temporal dynamics and spatial features. Extensive experiments on widely-used 3D action recognition and 4D semantic segmentation datasets demonstrate the effectiveness of the proposed PST convolution and the superiority of PSTNet in modeling point clouds sequences. The contributions of this paper are fourfold: • To the best of our knowledge, we are the first attempt to decompose spatial and temporal information in modeling raw point cloud sequences, and propose a generic point based convolutional operation, named PST convolution, to encode raw point cloud sequences. • We propose a PST transposed convolution to decode raw point cloud sequences via interpolating the temporal dynamics and spatial feature for point-level prediction tasks. • We construct convolutional neural networks based on the PST convolutions and transposed convolutions, dubbed PSTNet, to tackle sequence-level classification and point-level prediction tasks. To the best of our knowledge, our PSTNet is the first deep neural network to model raw point cloud sequences in a both spatially and temporally hierarchical manner. • Extensive experiments on four datasets indicate that our method improves the accuracy of 3D action recognition and 4D semantic segmentation.

2. RELATED WORK

Learning Representations on Grid based Videos. Impressive progress has been made on generating compact and discriminative representations for RGB/RGBD videos due to the success of deep neural networks. For example, two-stream convolutional neural networks (Simonyan & Zisserman, 2014; Wang et al., 2016) utilize a spatial stream and an optical flow stream for video modeling. To summarize the temporal dependencies of videos, recurrent neural networks (Ng et al., 2015; Fan et al., 2018) and pooling techniques (Fan et al., 2017) are employed. In addition, by stacking multiple 2D frames into a 3D tensor, 3D convolutional neural networks (Tran et al., 2015; Carreira & Zisserman, 2017; Tran et al., 2018; Hara et al., 2018) are widely used to learn spatio-temporal representations for videos, and achieve promising performance. Besides, interpretable video or action reasoning methods (Zhuo et al., 2019; Fan et al., 2021b) are proposed by explicitly parsing changes in videos.



Figure 1: Illustration of grid based and point based convolutions on spatio-temporal sequences. (a) For a grid based video, each grid represents a feature of a pixel, where C, L, H and W denote the feature dimension, the number of frames, height and width, respectively. A 3D convolution encodes an input to an output of size C × L × H × W . (b) A point cloud sequence consists of a coordinate part (3 × L × N ) and a feature part (C × L × N ), where N indicates the number of points in a frame.Our PST convolution encodes an input to an output composed of a coordinate tensor (3 × L × N ) and a feature tensor (C × L × N ). Usually, L ≤ L and N ≤ N so that networks can model point cloud sequences in a spatio-temporally hierarchical manner. Note that points in different frames are not consistent, and thus it is challenging to capture the spatio-temporal correlation.

