GRAPH CONTRASTIVE LEARNING FOR SKELETON-BASED ACTION RECOGNITION

Abstract

In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still local since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeletonbased action recognition (SkeletonGCL) to explore the global context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and interclass dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks. The source code will be available at https://github.com/OliverHxh/SkeletonGCL.

1. INTRODUCTION

Graph convolutional networks (GCNs) have been widely applied in skeleton-based action recognition since they can naturally process non-grid skeleton sequences. For GCN-based methods, how to effectively learn the graphs remains a core and challenging problem. In particular, ST-GCN (Yan et al., 2018 ) is a milestone work, using pre-defined graphs to extract the action patterns. However, the pre-defined graphs only enable each joint to access the fixed neighboring joints but are hard to capture long-range dependency adaptively. Therefore, a mainstream of subsequent works (Li et al., 2019; Shi et al., 2019; Zhang et al., 2020b; a; Ye et al., 2020; Chen et al., 2021b; Chi et al., 2022) take efforts to solve this issue by generating adaptive graphs. The adaptive graphs can dynamically aggregate the features within each sequence and thus show significant advantages in performance comparison. Generally, adaptive graphs are constructed by using intra-sequence context. However, such context will still be "local" when viewing the cross-sequence information as an available context. Therefore, we wonder: Is it possible to involve the cross-sequence context in graph learning? To find out the answer, in Fig. 1 , we visualize the adaptive graphs learned from sequences of two easily confused classes ("point to something" and "take a selfie"). The graphs are learned by a strong GCN, i.e., (2) For a misclassified sequence in Fig. 1 (c), the learned graph resembles the graphs from the misclassified class more than those from the ground truth class. These observations remind us that graph learning in current adaptive GCNs can implicitly learn class-specific graph representations to some extent. But without explicit constraints, it leads to class-ambiguous representations in some cases, thereby affecting the GCN capacity to discriminate classes (in Tab. 9 of Sec. 4.4, we provide quantitative results to further support our hypothesis). Therefore, we speculate that if the cross-sequence semantic relations are incorporated as guidance in graph learning, the class-ambiguity issue will be alleviated and the graph representations will better express individual characteristics of actions. In recent years, contrastive learning has achieved great success in self-supervised representation learning (He et al., 2020; Chen et al., 2020; 2021a) . It studies cross-sample relations in the dataset. The essence of contrastive learning is "comparing", which pulls together the feature embedding from positive pairs and pushes away the feature embedding from negative pairs. Based on the analysis above and the advances in contrastive learning, we propose a graph contrastive learning framework for skeleton-based action recognition in the fully-supervised setting, dubbed SkeletonGCL. Instead of just using the local information within each sequence, SkeletonGCL explores the cross-sequence global context to improve graph learning. The core idea is to pull together the learned graphs from the same class while pushing away the learned graphs from different classes. Since graphs can reveal the action patterns of sequences, enforcing graph consistency in the same class and inconsistency among different classes helps the model understand various motion modes. In addition, to enrich the cross-sequence context, we build memory banks to store the graphs from historical sequences. In specific, an instance-level memory bank stores the sequence-wise graphs, which hold the individual properties of each sequence. In contrast, a semantic-level memory bank stores the aggregation of graphs from each class, which contains the class-level representation. et al., 2014) ). SkeletonGCL achieves consistent improvements with these models using different testing protocols (single-modal or multi-modal) on all three datasets, which widely demonstrates the effectiveness of our design. Notably, SkeletonGCL only introduces a small amount of training consumption but has no impact at the test stage.



* Work done when Xiaohu Huang was an intern at Baidu VIS. † Corresponding authors.



Figure1: Graph visualization of sequences from two easily confused classes ("point to something" and "take a selfie"). The graphs are learned by CTR-GCN(Chen et al., 2021b). We take the tip of the hand that does the action as the anchor. The size of the red circles and the width of the blue lines both denote the strengths of connections between joints. For simplicity, only representative frames are visualized. (a) Three sequences from class "point to something" are correctly classified, where the graphs contain connections to the body joints. (b) Three sequences from class "take a selfie" are correctly classified, where the graphs highly emphasize the connections to the hands, while the connections to the body are suppressed. (c) A sequence from class "point to something" is misclassified as "take a selfie", whose graph resembles the graphs in (b), but is dissimilar from graphs in (a). Hence, we realize that the class-ambiguous graph representations would make negative impacts on recognition performance.

The two banks are complementary to each other, enabling us to leverage more samples. Skele-tonGCL can be seamlessly combined with existing GCNs. Eventually, we combine SkeletonGCL with three GCNs (2S-AGCN (Shi et al., 2019), CTR-GCN (Chen et al., 2021b), and InfoGCN (Chi et al., 2022)), and conduct experiments on three popular datasets (NTU60 (Shahroudy et al., 2016), NTU120 (Liu et al., 2019) and NW-UCLA (Wang

